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in the comprehensible framework of discrete mathematics. In Section [51 two prominent es- 
timation methods, the relative-frequency estimation and the maximum-likelihood estimation 
are presented. Section |31 is dedicated to the expectation-maximization algorithm and a sim- 
l/^ I pier variant, the generalized expectation- maximization algorithm. In Sectional two loaded 

dice are rolled. A more interesting example is presented in Section El The estimation of 
probabilistic context-free grammars. Enjoy! 



O ■ 2 Estimation Methods 

O ■ A statistics problem is a problem in which a corpus^ that has been generated in accordance 

with some unknown probability distribution must be analyzed and some type of inference 
^ , about the unknown distribution must be made. In other words, in a statistics problem there is 

^ ' a choice between two or more probability distributions which might have generated the corpus. 



In practice, there are often an infinite number of different possible distributions - statisticians 
bundle these into one single probability model - which might have generated the corpus. By 
analyzing the corpus, an attempt is made to learn about the unknown distribution. So, on the 
basis of the corpus, an estimation method selects one instance of the probability model, 
thereby aiming at finding the original distribution. In this section, two common estimation 
methods, the relative-frequency and the maximum-likelihood estimation, are presented. 

Corpora 

Definition 1 Let X be a countable set. A real-valued function f : X ^ TZ is called a corpus, 
if f 's values are non-negative numbers 

fix) > for allxeX 



^Statisticians use the term sample but computational linguists prefer the term corpus 
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Each X £ X is called a type, and each value of f is called a type frequency. The corpus 
size^ is defined as 

I/I = E /(^) 

Finally, a corpus is called non-empty and finite if 

0< I/I <oo 

In this definition, type frequencies are defined as non-negative real numbers. The reason for 
not taking natural numbers is that some statistical estimation methods define type frequencies 
as weighted occurrence frequencies (which are not natural but non- negative real numbers). 
Later on, in the context of the EM algorithm, this point will become clear. Note also that 
a finite corpus might consist of an infinite number of types with positive frequencies. The 
following definition shows that Definition ^ covers the standard notion of the term corpus 
(used in Computational Linguistics) and of the term sample (used in Statistics). 

Definition 2 Let xi, . . . ,Xn be a finite sequence of type instances from X. Each Xi of this 
sequence is called a token. The occurrence frequency of a type x in the sequence is defined 
as the following count 

f{x) = \{i \ Xi = x}\ 

Obviously, / is a corpus in the sense of Definition Q and it has the following properties: The 
type X does not occur in the sequence if f{x) = 0; In any other case there are /(x) tokens 
in the sequence which are identical to x. Moreover, the corpus size |/| is identical to n, the 
number of tokens in the sequence. 



Relative- Frequency Estimation 

Let us first present the notion of probability that we use throughout this paper. 

Definition 3 Let X be a countable set of types. A real-valued function p: X ^ TZ is called a 
probability distribution on X, if p has two properties: First, p's values are non-negative 
numbers 

p{x) > for all X £ X 

and second, p's values sum to 1 

Readers familiar to probability theory will certainly note that we use the term probability 



distribution in a sloppy way (Dudaet al. (2001) page 611, introduce the term probability 



mass function instead). Standardly, probability distributions allocate a probability value 
p{A) to subsets A <^ X, so-called events of an event space X, such that three specific 



axioms are satisfied (see e.g. DeGroot (1989) |: 



Axiom 1 p{A) > for any event A. 

■^Note that the corpus size |/| is well-defined: The order of summation is not relevant for the value of the 
(possible infinite) series X/a;g;f /(^)' since the types are countable and the type frequencies are non-negative 
numbers 
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Figure 1: Maximum- likelihood estimation and relative- frequency estimation 



Axiom 2 p{X) = 1. 

Axiom 3 p(\J'^i Ai) = J2i^iP{^i) for any infinite sequence of disjoint events ^i, ^42, A3, ... 

Now, however, note that the probabiUty distributions introduced in Definition |31 induce rather 
naturally the following probabilities for events A <^ X 

Pi^) ■= ^Pi^) 

Using the properties of p(x), we can easily show that the probabilities p{A) satisfy the three 
axioms of probability theory. So, Definition [31 is justified and thus, for the rest of the paper, 
we are allowed to put axiomatic probability theory out of our minds. 

Definition 4 Let f be a non-empty and finite corpus. The probability distribution 



p: X ^ [0,1] where p{x) = 




is called the relative- frequency estimate on f . 

The relative-frequency estimation is the most comprehensible estimation method and has 
some nice properties which will be discussed in the context of the more general maximum- 
likelihood estimation. For now, however, note that p is well defined, since both |/| > 
and I/I < 00. Moreover, it is easy to check that p's values sum to one: J2xexPi^) = 
E.g^ \f\-' • fix) = l/l-i • E.e;^ fix) = \f\-' • I/I = 1. 

Maximum-Likelihood Estimation 

Maximum-likelihood estimation was introduced by R. A. Fisher in 1912, and will typically 
yield an excellent estimate if the given corpus is large. Most notably, maximum-likelihood esti- 
mators fulfill the so-called invariance principle and, under certain conditions which are typ- 
ically satisfied in practical problems, they are even consistent estimators (|De(irroot 1989|) . 
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For these reasons, maximum-likelihood estimation is probably the most widely used estima- 
tion method. 

Now, unlike relative-frequency estimation, maximum-likelihood estimation is a fully-hedged 
estimation method that aims at selecting an instance of a given probability model which 
might have originally generated the given corpus. By contrast, the relative- frequency estimate 
is defined on the basis of a corpus only (see Definition ^ . Figure H reveals the conceptual 
difference of both estimation methods. In what follows, we will pay some attention to de- 
scribe the single setting, in which we are exceptionally allowed to mix up both methods (see 
Theorem^. Let us start, however, by presenting the notion of a probability model. 

Definition 5 A non-empty set M of probability distributions on a set X of types is called a 
probability model on X . The elements of M. are called instances of the model M.. The 
unrestricted probability model is the set A4{X) of all probability distributions on the set 
of types 



M{X) = |p: ;f ^ [0,1 

A probability model A4 is called restricted in all other cases 

MCM{X) and M ^ M{X) 

In practice, most probability models are restricted since their instances are often defined on a 
set X comprising multi-dimensional types such that certain parts of the types are statistically 
independent (see examples ^ and [5J . Here is another side note: We already checked that 
the relative- frequency estimate is a probability distribution, meaning in terms of Definitional 
that the relative-frequency estimate is an instance of the unrestricted probability model. So, 
from an extreme point of view, the relative-frequency estimation might be also regarded as 
a fully-fledged estimation method exploiting a corpus and a probability model (namely, the 
unrestricted model). 

In the following, we define maximum-likelihood estimation as a method that aims at 
finding an instance of a given model which maximizes the probability of a given corpus. 
Later on, we will see that maximum-likelihood estimates have an additional property: They 
are the instances of the given probability model that have a "minimal distance" to the relative 
frequencies of the types in the corpus (see Theorem |21). So, indeed, maximum-likelihood 
estimates can be intuitively thought of in the intended way: They are the instances of the 
probability model that might have originally generated the corpus. 

Definition 6 Let f be a non-empty and finite corpus on a countable set X of types. Let M. 
be a probability model on X . The probability of the corpus allocated by an instance p of 
the model Ai is defined as 

L{f;p)= n^^)^^"^ 

An instance p of the model M. is called a maximum-likelihood estimate of A4 on f , if 

and only if the corpus f is allocated a maximum probability by p 

L{f;p) = maxL(/;p) 
(Based on continuity arguments, we use the convention that p^ = 1 and 0^ = 1.) 
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Figure 2: Maximum- likelihood estimation and relative-frequency estimation yield for some "excep- 
tional" probability models the same estimate. These models are lightly restricted or even unrestricted 
models that contain an instance comprising the relative frequencies of all corpus types (left-hand side). 
In practice, however, most probability models will not behave like that. So, maximum-likelihood es- 
timation and relative-frequency estimation yield in most cases different estimates. As a further and 
more serious consequence, the maximum-likelihood estimates have then to be searched for by genuine 
optimization procedures (right-hand side). 

By looking at this definition, we recognize that maximum-likelihood estimates are the solu- 
tions of a quite complex optimization problem. So, some nasty questions about maximum- 
likelihood estimation arise: 

Existence Is there for any probability model and any corpus a maximum-likelihood 
estimate of the model on the corpus? 

Uniqueness Is there for any probability model and any corpus a unique maximum- 
likelihood estimate of the model on the corpus? 

Computability For which probability models and corpora can maximum-likelihood 
estimates be efficiently computed? 

For some probability models A4, the following theorem gives a positive answer. 
Theorem 1 Let f be a non-empty and finite corpus on a countable set X of types. Then: 

(i) The relative-frequency estimate p is a unique maximum-likelihood estimate of the unre- 
stricted probability model A4{X) on f. 

(a) The relative-frequency estimate p is a maximum-likelihood estimate of a (restricted or 
unrestricted) probability model A4 on f , if and only if p is an instance of the model M. 
In this case, p is a unique maximum-likelihood estimate of M on f . 

Proof Ad (i): Combine theorems [H and 01 Ad (ii): is trivial. by (i) q.e.d. 

On a first glance, proposition (ii) seems to be more general than proposition (i), since propo- 
sition (i) is about one single probability model, the unrestricted model, whereas proposi- 
tion (ii) gives some insight about the relation of the relative- frequency estimate to a maximum- 
likelihood estimate of arbitrary restricted probability models (see also Figure |2| . Both propo- 
sitions, however, are equivalent. As we will show later on, proposition (i) is equivalent to 
the famous information inequality of information theory, for which various proofs have been 
given in the literature. 
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Example 1 On the basis of the following corpus 

f{a) = 2, f{b) = 3, /(c) = 5 

we shall calculate the maximum-likelihood estimate of the unrestricted probability model 
M{{a,b,c}), as well as the maximum-likelihood estimate of the restricted probability model 

M = {p(£ M{{a,b,c}) I p{a) = O.s} 

The solution is instructive, but is left to the reader. 



The Information Inequality of Information Theory 

Definition 7 The relative entropy D(p \\ q) of the probability distribution p with respect 
to the probability distribution q is defined by 

D{p\\q)=J2pix)logP^''^ 



xex 



(x) 



(Based on continuity arguments, we use the convention that Olog ^ = and p log g = oo and 
Olog ^ = 0. The logarithm is calculated with respect to the base 2.) 

Connecting maximum-likelihood estimation with the concept of relative entropy, the follow- 
ing theorem gives the important insight that the relative-entropy of the relative-frequency 
estimate is minimal with respect to a maximum-likelihood estimate. 

Theorem 2 Letp be the relative-frequency estimate on a non-empty and finite corpus f , and 
let Ad be a probability model on the set X of types. Then: An instance p of the model A4 is 
a maximum-likelihood estimate of M on f, if and only if the relative- entropy of p is minimal 
with respect to p 

^iP W P) — D{p II p) 

Proof First, the relative entropy D{p || p) is simply the difference of two further entropy 
values, the so-called cross-entropy H{p;p) = — J2x^x Pi^) logp(x) and the entropy H{p) = 
— J2xexPi^)^^SP{x) of the relative-frequency estimate 

D{p \ \p) = H{p;p) - H(p) 

(Based on continuity arguments and in full agreement with the convention used in Definitional 
we use here that plogO = — oo and OlogO = 0.) It follows that minimizing the relative 
entropy is equivalent to minimizing the cross-entropy (as a function of the instances p of 
the given probability model A4). The cross-entropy, however, is proportional to the negative 
log-probability of the corpus / 

H{p;p) = -|^logL(/;p) 
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So, finally, minimizing the relative entropy D{p \ \ p) is equivalent to maximizing the corpus 
probability L{f;p). ^ 

Together with Theorem |21 the following theorem, the so-called information inequality of 
information theory, proves Theorem ^ The information inequality states simply that the 
relative entropy is a non-negative number - which is zero, if and only if the two probability 
distributions are equal. 

Theorem 3 (Information Inequality) Let p and q be two probability distributions. Then 

D{p \ \ q)>0 

with equality if and only if p{x) = q{x) for all x ^ X . 
Proof See, e.g.. Cover and Thomas (1991)| page 26. 



*Maximum-Entropy Estimation 

Readers only interested in the expectation-maximization algorithm are encouraged to omit 
this section. For completeness, however, note that the relative entropy is asymmetric. That 
means, in general 

D{p\\q) ^ D{q\\p) 

It is easy to check that the triangle inequality is not valid too. So, the relative entropy -D(.||.) 
is not a "true" distance function. On the other hand, has some of the properties of a 

distance function. In particular, it is always non-negative and it is zero if and only \i p = q 
(see Theorem inj. So far, however, we aimed at minimizing the relative entropy with respect 
to its second argument, filling the first argument slot of -D(.||.) with the relative- frequency 
estimate p. Obviously, these observations raise the question, whether it is also possible to 
derive other "good" estimates by minimizing the relative entropy with respect to its first 
argument. So, in terms of Theorem 121 it might be interesting to ask for model instances 
p* ^ M with 

D{p*\\p) = mm D{p\\p) 

For at least two reasons, however, this initial approach of relative-entropy estimation is too 
simplistic. First, it is tailored to probability models that lack any generalization power. 
Second, it does not provide deeper insight when estimating constrained probability models. 
Here are the details: 

^For completeness, note that the perplexity of a corpus / aUocated by a model instance p is defined as 
perp(/;p) = 2"'^'''P\ This yields perp(/;p) = ^'-{J jjjTp^ and L{f-p) = (^ perp(/;p) ) as weU as the common 
interpretation that the perplexity value measures the complexity of the given corpus from the 
model instance's view: the perplexity is equal to the size of an imaginary word list from which the corpus 
can be generated by the model instance - assuming that all words on this list are equally probable. Moreover, 
the equations state that minimizing the corpus perplexity perp(/;p) is equivalent to maximizing the corpus 
probability L{f;p). 
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• A closer look at Definition [7| reveals that the relative entropy D(p\\p) is finite for those 
model instances p (z M only that fulfill 

p{x) = =^ p{x) = 

So, the initial approach would lead to model instances that are completely unable to 
generalize, since they are not allowed to allocate positive probabilities to at least some 
of the types not seen in the training corpus. 

• Theorem 121 guarantees that the relative- frequency estimate p is a solution to the initial 
approach of relative-entropy estimation, whenever p G M. Now, Definition |S1 introduces 
the constrained probability models Mconstr, and indeed, it is easy to check that p is 
always an instance of these models. In other words, estimating constrained probability 
models by the approach above does not result in interesting model instances. 

Clearly, all the mentioned drawbacks are due to the fact that the relative-entropy minimization 
is performed with respect to the relative-frequency estimate. As a resource, we switch simply 
to a more convenient reference distribution, thereby generalizing formally the initial problem 
setting. So, as the final request, we ask for model instances p* & ^A with 

D{p*\\po) = mmD{p\\po) 

In this setting, the reference distribution po G A4{X) is a given instance of the unrestricted 
probability model, and from what we have seen so far, pQ should allocate all types of interest 
a positive probability, and moreover, pQ should not be itself an instance of the probability 
model A4. Indeed, this request will lead us to the interesting maximum-entropy estimates. 
Note first, that 

D{p\\po) = H{p-po)-H{p) 

So, minimizing D(p\\pq) as a function of the model instances p is equivalent to minimizing 
the cross entropy H(p;pq) and simultaneously maximizing the model entropy H{p). Now, 
simultaneous optimization is a hard task in general, and this gives reason to focus firstly 
on maximizing the entropy H{p) in isolation. The following definition presents maximum- 



entropy estimation in terms of the well-known maximum-entropy principle (Jaynes 1957). 
Sloppily formulated, the maximum-entropy principle recommends to maximize the entropy 
H{p) as a function of the instances p of certain "constrained" probability models. 

Definition 8 Let fi, ■ ■ ■ , fd be a finite number of real-valued functions on a set X of types, 
the so-called feature functions'^. Let p be the relative-frequency estimate on a non-empty 



*Each of these feature functions can be thought of as being constructed by inspecting the set of types, 
thereby measuring a specific property of the types x £ X. For example, if working in a formal- grammar 
framework, then it might be worthy to look (at least) at some feature functions fr directly associated to the 
rules r of the given formal grammar. The "measure" fr{x) of a specific rule r for the analyzes x £ X oi the 
grammar might be calculated, for example, in terms of the occurrence frequency of r in the sequence of those 
rules which are necessary to produce x. For instance, |Chi (1 999) studied this approach for the context-free 
grammar formalism. Note, however, that there is in general no recipe for constructing "good" feature functions: 
Often, it is really an intellectual challenge to find those feature functions that describe the given data as best 
as possible (or at least in a satisfying manner). 
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and finite corpus f on X . Then, the probability model constrained by the expected 
values of fi . . . fd on f is defined as 



Mconstr = {peM{X) 



Epfi = Epfi fori = l,...,d 



Here, each Epfi is the model instance 's expectation of fi 

Epfi = ^ p{x)fi{x) 

constrained to match Epfi, the observed expectation of fi 

Epfi = P{x)fi{x) 

Furthermore, a model instance p* G Mconstr is called a maximum- entropy estimate of 

Mconstr if and only if 

H{p*) = max H{p) 

onstr 

It is well-known that the maximum-entropy estimates have some nice properties. For example, 
as Definition El and Theorem |3] show, they can be identified to be the unique maximum- 
likelihood estimates of the so-called exponential models (which are also known as log-linear 
models). 

Definition 9 Let fi, ■ ■ ■ , fd be a finite number of feature functions on a set X of types. The 
exponential model of f\, . . . ,fd is defined by 



Mexp = {peM{X) 



p(x) = J_e>^iM'-^)+-+^<iM^) with Ai, . . . , Ad, Za G 7^ I 
Zx J 



Here, the normalizing constant Z\ (with X as a short form for the sequence Ai,... ,Xd) 
guarantees that p G A4{X), and it is given by 

Z^=Y^ gAi/i(x)+...+Ad/d(x) 

Theorem 4 Let f be a non-empty and finite corpus, and fi,-- - ,fd be a finite number of 
feature functions on a set X of types. Then 

(i) The maximum- entropy estimates of Mconstr are instances of Mexp, and the maximum- 
likelihood estimates of Mexp on f are instances of Mconstr- 

(a) If p* G Mconstr^ Mexp, then p* is both a unique maximum- entropy estimate of Mconstr 
and a unique maximum-likelihood estimate of Mexp on f . 

Part (i) of the theorem simply suggests the form of the maximum-entropy or maximum- 
likelihood estimates we are looking for. By combining both findings of (i), however, the 
search space is drastically reduced for both estimation methods: We simply have to look at 
the intersection of the involved probability models. In turn, exactly this fact makes the second 
part of the theorem so valuable. If there is a maximum-entropy or a maximum-likelihood 
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Figure 3: Maximum-likelihood estimation generalizes maximum-entropy estimation, as well as both 
variants of minimum relative-entropy estimation (where either the first or the second argument slot of 
is filled by a given probability distribution). 



estimate, then it is in the intersection of both models, and thus according to Part (ii), it is a 
unique estimate, and even more, it is both a maximum-entropy and a maximum-likelihood 
estimate. 



Proof See e.g. Cover and Thomas (1991) pages 266-278. For an interesting alternate proof 
of (ii), see Ratnaparkhi (1997) Note, however, that the proof of Ratnaparkhi's Theorem 1 
is incorrect, whenever the set X of types is infinite. Although Ratnaparkhi's proof is very 
elegant, it relies on the existence of a uniform distribution on X that simply does not exist 
in this special case. By contrast. Cover and Thomas prove Theorem 11.1.1 without using a 
uniform distribution on X, and so, they achieve indeed the more general result. 

Finally, we are coming back to our request of minimizing the relative entropy with respect to 
a given reference distribution pQ G ^A{X). For constrained probability models, the relevant 
results differ not much from the results described in Theorem HI So, let 



Mexp-ref = {peM{X) 



p{x) = i_e^i/i(-)+-+^d/d(-) . po(x) with Ai, . . . , Ad, Za € 7e 



Then, along the lines of the proof of Theorem 0] it can be also proven that the following 
propositions are valid. 

(i) The minimum relative-entropy estimates of Mconstr sue instances of Mexp-ref ^ and the 
maximum-likelihood estimates of Mexp-ref on / are instances of Mconstr- 

(ii) If p* G Mconstr n Mexp-ref, then p* IS both a unique minimum relative-entropy estimate 
of Mconstr and a unique maximum-likelihood estimate of Mexp-ref on /. 

All results are displayed in Figure 01 

3 The Expectation-Maximization Algorithm 



The expectation-maximization algorithm was introduced by Dempster et al. (1977)1 ^^o also 



presented its main properties. In short, the EM algorithm aims at finding maximum-likelihood 
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Figure 4: Input and output of the EM algorithm. 



estimates for settings where this appears to be difficult if not impossible. The trick of the 
EM algorithm is to map the given data to complete data on which it is well-known how 
to perform maximum-likelihood estimation. Typically, the EM algorithm is applied in the 
following setting: 

• Direct maximum-likelihood estimation of the given probability model on the given cor- 
pus is not feasible. For example, if the likelihood function is too complex (e.g. it is a 
product of sums). 

• There is an obvious (but one-to-many) mapping to complete data, on which maximum- 
likelihood estimation can be easily done. The prototypical example is indeed that 
maximum-likelihood estimation on the complete data is already a solved problem. 

Both relative-frequency and maximum-likelihood estimation are common estimation methods 
with a two-fold input, a corpus and a probability model^ such that the instances of the 
model might have generated the corpus. The output of both estimation methods is simply 
an instance of the probability model, ideally, the unknown distribution that generated the 
corpus. In contrast to this setting, in which we are almost completely informed (the only 
thing that is not known to us is the unknown distribution that generated the corpus), the 
expectation-maximization algorithm is designed to estimate an instance of the probability 
model for settings, in which we are incompletely informed. 

To be more specific, instead of a complete-data corpus, the input of the expectation- 
maximization algorithm is an incomplete-data corpus together with a so-called symbolic 
analyzer. A symbolic analyzer is a device assigning to each incomplete-data type a set 
of analyzes, each analysis being a complete-data type. As a result, the missing complete- 
data corpus can be partly compensated by the expectation-maximization algorithm: The 
application of the the symbolic analyzer to the incomplete-data corpus leads to an ambiguous 
complete-data corpus. The ambiguity arises as a consequence of the inherent analytical 
ambiguity of the symbolic analyzer: the analyzer can replace each token of the incomplete- 
data corpus by a set of complete-data types - the set of its analyzes - but clearly, the symbolic 
analyzer is not able to resolve the analytical ambiguity. 

The expectation-maximization algorithm performs a sequence of runs over the resulting 
ambiguous complete-data corpus. Each of these runs consists of an expectation step fol- 
lowed by a maximization step. In the E step, the expectation-maximization algorithm 

^We associate the relative-frequency estimate with the unrestricted probabihty model 
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combines the symbolic analyzer with an instance of the probability model. The results of 
this combination is a statistical analyzer which is able to resolve the analytical ambi- 
guity introduced by the symbolic analyzer. In the M step, the expectation-maximization 
algorithm calculates an ordinary maximum-likelihood estimate on the resolved complete-data 
corpus. 

In general, however, a sequence of such runs is necessary. The reason is that we never 
know which instance of the given probability model leads to a good statistical analyzer, and 
thus, which instance of the probability model shall be used in the E-step. The expectation- 
maximization algorithm provides a simple but somehow surprising solution to this serious 
problem. At the beginning, a randomly generated starting instance of the given probability 
model is used for the first E-step. In further iterations, the estimate of the M-step is used 
for the next E-step. Figure |1] displays the input and the output of the EM algorithm. The 
procedure of the EM algorithm is displayed in Figure [51 

Symbolic and Statistical Analyzers 

Definition 10 Let X and y he non-empty and countable sets. A function 

A-.y 

is called a symbolic analyzer if the (possibly empty) sets of analyzes A{y) Q X are 
pair-wise disjoint, and the union of all sets of analyzes A{y) is complete 

y&y 

In this case, y is called the set of incomplete- data types, whereas X is called the set of 
complete-data types. So, in other words, the analyzes A{y) of the incomplete- data types y 
form a partition of the complete-data X . Therefore, for each x € X exists a unique y £ y, 
the so-called yield of x, such that x is an analysis of y 

y = yield{x) if and only if x G A{y) 

For example, if working in a formal-grammar framework, the grammatical sentences can be 
interpreted as the incomplete-data types, whereas the grammatical analyzes of the sentences 
are the complete-data types. So, in terms of Definition IIUI a so-called parser - a device 
assigning a set of grammatical analyzes to a given sentence - is clearly a symbolic analyzer: 
The most important thing to check is that the parser does not assign a given grammatical 
analysis to two different sentences - which is pretty obvious, if the sentence words are part 
of the grammatical analyzes. 

Definition 11 A pair <A,p> consisting of a symbolic analyzer A and a probability distri- 
bution p on the complete-data types X is called a statistical analyzer. We use a statistical 
analyzer to induce probabilities for the incomplete- data types y y 

p{y) ■= P^^) 
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Figure 5: Procedure of the EM algorithm. An incomplete-data corpus, a symbolic analyzer (a device 
assigning to each incomplete-data type a set of complete-data types), and a complete-data model are 
given. In the E step, the EM algorithm combines the symbolic analyzer with an instance q of the 
probability model. The results of this combination is a statistical analyzer that is able to resolve the 
ambiguity of the given incomplete data. In fact, the statistical analyzer is used to generate an expected 
complete-data corpus fq. In the M step, the EM algorithm calculates an ordinary maximum-likelihood 
estimate of the complete-data model on the complete-data corpus generated in the E step. In further 
iterations, the estimates of the M-steps are used in the subsequent E-steps. The output of the EM 
algorithm are the estimates that are produced in the M steps. 

Even more important, we use a statistical analyzer to resolve the analytical ambiguity of 
an incomplete- data type y y by looking at the conditional probabilities of the analyzes 
X G A{y) 

p(x^ 

p(x\y) := —— where y = yield{x) 

p{y) 

It is easy to check that the statistical analyzer induces a proper probability distribution on 
the set y of incomplete-data types 

y&y y&yx€A{y) ocex 

Moreover, the statistical analyzer induces also proper conditional probability distributions on 
the sets of analyzes A{y) 

Of course, by defining p{x\y) = for y ^ yield(a;), p{.\y) is even a probability distribution on 
the full set X of analyzes. 

Input, Procedure, and Output of the EM Algorithm 

Definition 12 The input of the expectation-maximization (EM) algorithm is 
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(i) a symbolic analyzer, i.e., a function A which assigns a set of analyzes A{y) Q X 
to each incomplete- data type y (z y, such that all sets of analyzes form a partition 
of the set X of complete- data types 

X = Y.A{y) 

y&y 

(a) a non-empty and finite incomplete- data corpus, i.e., a frequency distribution f on 
the set of incomplete- data types 

f-.y^TZ such that fiy)>OforaUyey and < |/| < oo 

(Hi) a complete-data model M C Ji4(X), i.e., each instance p £ M is a probability 
distribution on the set of complete-data types 

p:X^[0,l] and ^ p{x) = 1 

(*) implicit input: an incomplete- data model M. Q M.{y) induced by the symbolic 
analyzer and the complete-data model. To see this, recall Definition Mll Together with a 
given instance of the complete-data model, the symbolic analyzer constitutes a statistical 
analyzer which, in turn, induces the following instance of the incomplete- data model 

p: y ^ [0, 1] and p{y) = ^ p{x) 

(Note: For both complete and incomplete data, the same notation symbols Ad and p are 
used. The sloppy notation, however, is justified, because the incomplete- data model is a 
marginal of the complete-data model.) 

(iv) a (randomly generated) starting instance pQ of the complete-data model Ad. 

(Note: If permitted by A4, then pQ should not assign to any x € X a probability of zero.) 

Definition 13 The procedure of the EM algorithm is 

(1) for each i = 1, 2, 3, ... do 

(^) q ■■= Pi-i 

(3) E-step: compute the complete-data corpus fq-. X TZ expected by q 

fqi^) '■= f(.y) ■ Q{x\y) where y = yield{x) 

(4) M-step: compute a maximum-likelihood estimate p of Ad on fq 

L{fq]p) = ma^ L{fq,p) 
(Implicit pre-condition of the EM algorithm: it exists!) 

(5) Pi.= p 

(6) end // for each i 

(7) print po,pi,p2,P3, - 
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incomplete-data corpus f 

K 



y3 



distribute f(y) to the analyzes x of y according q(xly) 



^11 Xi2 Xi3 X.. 



X21 X22 X23 X.. 



X31 X32 X33 X.. 



analyzes of y 1 
total frequency = f( y 1 ) 



analyzes of yz 
total frequency = f( y2 ) 



analyzes of y ? 
total frequency = f{ y3 ) 



V • • • 

complete-data corpus f q 

Figure 6: The E step of the EM algorithm. A complete-data corpus fq{x) is generated on the basis 
of the incomplete-data corpus f{y) and the conditional probabilities q{x\y) of the analyzes of y. The 
frequency f{y) is distributed among the complete-data types x G A{y) according to the conditional 
probabilities q{x\y). A simple reversed procedure guarantees that the original incomplete-data corpus 
f{y) can be recovered from the generated corpus fq{x): Sum up all frequencies fq{x) with x G A{y). 
So the size of both corpora is the same \ fq\ = |/|. Memory hook: fq is the gomplete data corpus. 



In line (3) of the EM procedure, a complete-data corpus fq{x) has to be generated on the basis 
of the incomplete-data corpus f{y) and the conditional probabilities q{x\y) of the analyzes of y 
(conditional probabilities are introduced in Definition II In fact, this generation procedure 
is conceptually very easy: according to the conditional probabilities q{x\y), the frequency 
f{y) has to be distributed among the complete-data types x E A{y). Figure El displays the 
procedure. Moreover, there exists a simple reversed procedure (summation of all frequencies 
fq{x) with X € A{y)) which guarantees that the original corpus f{y) can be recovered from 
the generated corpus fq{x). Finally, the size of both corpora is the same 

l/.l = l/l 

In line (4) of the EM procedure, it is stated that a maximum-likelihood estimate p of the 
complete-data model has to be computed on the complete-data corpus fq expected by q. 
Recall for this purpose that the probability of fq allocated by an instance p G is defined 
as 



L{fq,p) 



X 



In contrast, the probability of the incomplete-data corpus / allocated by an instance p of the 
incomplete-data model is much more complex. Using Definition 1121 *. we get an expression 
involving a product of sums 



L{f;p) 




fiy) 



xeA(y) 
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Nevertheless, the fohowing theorem reveals that the EM algorithm aims at finding an instance 
of the incomplete-data model which possibly maximizes the probability of the incomplete-data 
corpus. 

Theorem 5 The output of the EM algorithm is: A sequence of instances of the complete-data 
model M., the so-called EM re-estimates, 

Po, Pi, P2, P3, ••• 

such that the sequence of probabilities allocated to the incomplete- data corpus is monotonic 
increasing 

L{f;po) < L{f;pi) < L{f;p2) < L{f;p3) < . . . 

It is common wisdom that the sequence of EM re-estimates will converge to a (local) maximum- 
likelihood estimate of the incomplete-data model on the incomplete-data corpus. As proven by 



Wu (1983) however, the EM algorithm will do this only in specific circumstances. Of course, 
it is guaranteed that the sequence of corpus probabilities (allocated by the EM re-estimates) 
must converge. However, we are more interested in the behavior of the EM re-estimates itself. 
Now, intuitively, the EM algorithm might get stuck in a saddle point or even a local mini- 
mum of the corpus-probability function, whereas the associated model instances are hopping 
uncontrolled around (for example, on a circle-like path in the "space" of all model instances). 

Proof See theorems El and [3 

The Generalized Expectation-Maximization Algorithm 

The EM algorithm performs a sequence of maximum-likelihood estimations on complete data, 
resulting in good re-estimates on incomplete-data ("good" in the sense of Theorem [SJ. The 
following theorem, however, reveals that the EM algorithm might overdo it somehow, since 
there exist alternative M-steps which can be easier performed, and which result in re-estimates 
having the same property as the EM re-estimates. 

Definition 14 A generalized expectation-maximization ( GEM) algorithm has exactly the same 
input as the EM-algorithm, but an easier M-step is performed in its procedure: 

(4) M-step ( GEM): compute an instance p of the complete-data model M such that 

L{f,;p)>L{fg;q) 

Theorem 6 The output of a GEM algorithm is: A sequence of instances of the complete-data 
model A4, the so-called GEM re-estimates, such that the sequence of probabilities allocated 
to the incomplete- data corpus is monotonic increasing. 

Proof Various proofs have been given in the literature. The first one was presented by 



Dempster et al. (1977) For other variants of the EM algorithm, the book of McLachlan and Krishnan (1997) 
is a good source. Here, we present something along the lines of the original proof. Clearly, 
a proof of the theorem requires somehow that we are able to express the probability of the 
given incomplete- data corpus f in terms of the the probabilities of complete-data corpora /< 
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which are involved in the M-steps of the GEM algorithm (where both types of corpora are 
allocated a probability by the same instance p of the model M). A certain entity, which we 
would like to call the expected cross-entropy on the analyzes, plays a major role for 
solving this task. To be specific, the expected cross-entropy on the analyzes is defined as the 
expectation of certain cross-entropy values H_^(^y-^{q,p) which are calculated on the different 
sets A{y) of analyzes. Then, of course, the "expectation" is calculated on the basis of the 
relative-frequency estimate p of the given incomplete-data corpus 

yey 

Now, for two instances q and p of the complete-data model, their conditional probabilities 
q{x\y) and p{x\y) form proper probability distributions on the set A{y) of analyzes of y (see 
Definition [TT|) . So, the cross-entropy Hj^(^y^{q;p) on the set A{y) is simply given by 

HA(y){Q'^P) = - (li^\y)'^°SP{x\y) 

Recalling the central task of this proof, a bunch of relatively straight-forward calculations 
leads to the following interesting equation^ 

L(/;p) = (2^^(«^P))'^'.L(/,;p) 
Using this equation, we can state that 

Lif;q) V J Lif,;q) 

In what follows, we will show that, after each M-step of a GEM algorithm (i.e. for p being a 
GEM re-estimate p), both of the factors on the right-hand side of this equation are not less 
than one. First, an iterated application of the information inequality of information theory 
(see Theorem yields 

Ha{q; p) - HA{q, q) = Y P^y) • {^A{y) (9; p) - HA{y) (?; q)) 

y(^y 

= YP^y) ■ ^A{y)^^\\P) 

y<^y 

> 

''It is easier to show that 

H{p;p) = H{pg-p) - HA{q;p). 

Here, p is the relative-frequency estimate on the incomplete-data corpus /, whereas pq is the relative-frequency 
estimate on the complete-data corpus However, by defining an "average perplexity of the analyzes", 
perp_4(q;p) ~ 2^^'''''^ (see also Footnote|2J, the true spirit of the equation can be revealed: 

This equation states that the probability of a complete-data corpus (generated by a statistical analyzer) is the 
product of the probability of the given incomplete-data corpus and |/|-times the average probability of the 
different corpora of analyzes (as generated for each of the |/| tokens of the incomplete-data corpus). 
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So, the first factor is never (i.e. for no model instance p) less than one 



Second, by definition of the M-step of a GEM algorithm, the second factor is also not less 
than one 

HUP) ^ . 

So, it follows 

L{f;q) - 

yielding that the probability of the incomplete-data corpus allocated by the GEM rc-cstimate 
p is not less than the probability of the incomplete-data corpus allocated by the model instance 
q (which is either the starting instance po of the GEM algorithm or the previously calculated 
GEM re-estimate) 

L{f;p)>L{f;q) 
Theorem 7 An EM algorithm is a GEM algorithm. 

Proof In the M-step of an EM algorithm, a model instance p is selected such that 

L{fq;p) = maxL(/q,p) 

So, especially 

L{Up)>L{f,,q) 
and the requirements of the M-step of a GEM algorithm are met. 



4 Rolling Two Dice 

Example 2 We shall now consider an experiment in which two loaded dice are rolled, and 
we shall compute the relative- frequency estimate on a corpus of outcomes. 

If we assume that the two dice arc distinguishable, each outcome can be represented as a 
pair of numbers {xi,X2), where xi is the number that appears on the first die and X2 is the 
number that appears on the second die. So, for this experiment, an appropriate set X of 
types comprises the following 36 outcomes: 





,X2) 


X2 = 1 


X2 = 2 


a;2 = 3 


X2 = 4 


X2 = 5 


X2 = 6 


Xl 


= 1 


(1,1) 


(1,2) 


(1,3) 


(1,4) 


(1,5) 


(1,6) 


Xl 


= 2 


(2,1) 


(2,2) 


(2,3) 


(2,4) 


(2,5) 


(2,6) 


Xl 


= 3 


(3,1) 


(3,2) 


(3,3) 


(3,4) 


(3,5) 


(3,6) 


Xl 


= 4 


(4,1) 


(4,2) 


(4,3) 


(4,4) 


(4,5) 


(4,6) 


Xl 


= 5 


(5,1) 


(5,2) 


(5,3) 


(5,4) 


(5,5) 


(5,6) 


Xl 


= 6 


(6,1) 


(6,2) 


(6,3) 


(6,4) 


(6,5) 


(6,6) 
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If we throw the two dice a 100 000 times, then the following occurrence frequencies might 
arise 



f{xi,X2) 


X2 = I 


X2 = 2 


X2 = 3 


X2 = 4 


X2 = 5 


X2 = 6 


Xl = 1 


3790 


3773 


1520 


1498 


2233 


2298 


xi = 2 


3735 


3794 


1497 


1462 


2269 


2184 


Xl = 3 


4903 


4956 


1969 


2035 


2883 


3010 


Xi= i 


2495 


2519 


1026 


1049 


1487 


1451 


Xl = 5 


3820 


3735 


1517 


1498 


2276 


2191 


Xl = 6 


6369 


6290 


2600 


2510 


3685 


3673 



The size of this corpus is |/| = 100 000. So, the relative- frequency estimate p on f can be 
easily computed (see Definition |1J) 



p{xi,X2) 


X2 = I 


X2 = 2 


X2 = 3 


3;2 = 4 


X2 = 5 


X2 = 6 


Xl 


= 1 


0.03790 


0.03773 


0.01520 


0.01498 


0.02233 


0.02298 


Xl 


= 2 


0.03735 


0.03794 


0.01497 


0.01462 


0.02269 


0.02184 


Xl 


= 3 


0.04903 


0.04956 


0.01969 


0.02035 


0.02883 


0.03010 


Xl 


= 4 


0.02495 


0.02519 


0.01026 


0.01049 


0.01487 


0.01451 


Xl 


= 5 


0.03820 


0.03735 


0.01517 


0.01498 


0.02276 


0.02191 


Xl 


= 6 


0.06369 


0.06290 


0.02600 


0.02510 


0.03685 


0.03673 



Example 3 We shall consider again Experiment in which two loaded dice are rolled, but 
we shall now compute the relative- frequency estimate on the corpus of outcomes of the first 
die, as well as on the corpus of outcomes of the second die. 

If we look at the same corpus as in Example [21 then the corpus fi of outcomes of the first 
die can be calculated as /i(xi) = J2x2 fixi,X2). An analog summation yields the corpus of 
outcomes of the second die, f2{x2) = f{xi,X2)- Obviously, the sizes of all corpora are 
identical = I/2I = |/| = 100 000. So, the relative-frequency estimates pi on fi and p2 on 
/2 are calculated as follows 



fi{xi) 


Xl 


Pi{xi) 


Xl 


f2{x2) 


X2 


P2{X2) 


X2 


15112 


1 


0.15112 


1 


25112 


1 


0.25112 


1 


14941 


2 


0.14941 


2 


25067 


2 


0.25067 


2 


19756 


3 


0.19756 


3 


10129 


3 


0.10129 


3 


10027 


4 


0.10027 


4 


10052 


4 


0.10052 


4 


15037 


5 


0.15037 


5 


14833 


5 


0.14833 


5 


25127 


6 


0.25127 


6 


14807 


6 


0.14807 


6 



Example 4 We shall consider again Experiment in which two loaded dice are rolled, hut 
we shall now compute a maximum-likelihood estimate of the probability model which assumes 
that the numbers appearing on the first and second die are statistically independent. 



First, recall the definition of statistical independence (see e.g. Duda et al. (2001)| page 613) 



Definition 15 The variables xi and X2 are said to be statistically independent given a 
joint probability distribution p on X if and only if 



p{xi,X2) =Pi{xi) ■P2ix2) 
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where pi and p2 are the marginal distributions for xi and X2 

P2{X2) = ^p{xi,X2) 

So, let M.II2 be the probability model which assumes that the numbers appearing on the first 
and second die are statistically independent 

-^1/2 = {p £ J^{X) I Xl and X2 are statistically independent given p} 

In Example 121 we have calculated the relative- frequency estimator p. Theorem ^ states that p 
is the unique maximum-likelihood estimate of the unrestricted model M.{X). Thus, p is also a 
candidate for a maximum-likelihood estimate of Mi/2- Unfortunately, however, xi and X2 are 
not statistically independent givenp (see e.g. p{l, 1) = 0.03790 andpi(l)-p2(l) = 0.0379493). 
This has two consequences for the experiment in which two (loaded) dice are rolled: 

• the probability model, which assumes that the numbers appearing on the first and 
second die are statistically independent, is a restricted model (see Definition |^ , and 

• the relative-frequency estimate is in general not a maximum-likelihood esti- 
mate of the standard probability model assuming that the numbers appearing on 
the first and second die are statistically independent. 

Therefore, we are now following Definition El to compute the maximum-likelihood estimate 
of Using the independence property, the probability of the corpus / allocated by an 

instance p of the model M.1/2 can be calculated as 

L{f;p) = \ n pii^iY'^^A-i n P2{x2y'^^A = L{h-pi) ■ L{f2;p2) 

\xi=l,...,6 / \x2=l,...,6 / 

Definition El states that the maximum-likelihood estimate j5 of A4i/2 on / must maximize 
L{f;p). A product, however, is maximized, if and only if its factors are simultaneously 
maximized. Theorem ^ states that the corpus probabilities L(fi;pi) are maximized by the 
relative-frequency estimators pi. Therefore, the product of the relative- frequency estimators 
pi and p2 (on fi and /2 respectively) might be a candidate for the maximum-likelihood 
estimate p we are looking for 

p{xi,X2) =Pl{xi) ■P2{X2) 

Now, note that the marginal distributions of p are identical with the relative-frequency esti- 
mators on /i and /2. For example, p's marginal distribution for xi is calculated as 

Pl{xi) = ^p{xi,X2) = ^Pl(xi) •^2(2:2) =Pl{xi) ■^P2{X2) =Pl{xi) ■ 1 =Pl{xi) 
X2 X2 X2 

A similar calculation yields P2{x2) = P2{x2)- Both equations state that xi and X2 are indeed 
statistically independent given p 

p{xi,X2) =Pl{xi) ■P2{X2) 
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So, finally, it is guaranteed that p is an instance of the probability model M1/2 as required 
for a maximum-likelihood estimate of 1/2- Note: p is even an unique maximum-likelihood 
estimate since the relative-frequency estimates pi are unique maximum-likelihood estimates 
(see Theorem^. The relative- frequency estimates pi and p2 have already been calculated in 
Example 01 So, p is calculated as follows 



p{xi 


X2) 


X2 = 1 


X2 = 2 


X2 = 3 


2:2 = 4 


2:2 = 5 


X2 = 6 


Xl = 


= 1 


0.0379493 


0.0378813 


0.0153069 


0.0151906 


0.0224156 


0.0223763 


Xl = 


= 2 


0.0375198 


0.0374526 


0.0151337 


0.0150187 


0.022162 


0.0221231 


Xl = 


= 3 


0.0496113 


0.0495224 


0.0200109 


0.0198587 


0.0293041 


0.0292527 


Xl = 


= 4 


0.0251798 


0.0251347 


0.0101563 


0.0100791 


0.014873 


0.014847 


Xl = 


= 5 


0.0377609 


0.0376932 


0.015231 


0.0151152 


0.0223044 


0.0222653 


Xl = 


= 6 


0.0630989 


0.0629859 


0.0254511 


0.0252577 


0.0372709 


0.0372055 



Example 5 We shall consider again Experiment\^in which two loaded dice are rolled. Now, 
however, we shall assume that we are incompletely informed: the corpus of outcomes (which 
is given to us) consists only of the sums of the numbers which appear on the first and second 
die. Nevertheless, we shall compute an estimate for a probability model on the complete-data 
{xi,X2) G X. 

If we assume that the corpus which is given to us was calculated on the basis of the corpus 
given in Example [21 then the occurrence frequency of a sum y can be calculated as f{y) = 
X]a;i+X2=i/ /(^i' ^2)- These numbers are displayed in the following table 



fiy) 


y 


3790 


2 


7508 


3 


10217 


4 


10446 


5 


12003 


6 


17732 


7 


13923 


8 


8595 


9 


6237 


10 


5876 


11 


3673 


12 



For example, 

/(4) = f(l. 3) + f (2, 2) + f (3, 1) = 1520 + 3794 + 4903 = 10217 

The problem is now, whether this corpus of sums can be used to calculate a good esti- 
mate on the outcomes {xi,X2) itself. Hint: Examples \^ and^ have shown that a unique 
relative-frequency estimate p{xi,X2) and a unique maximum-likelihood estimate p{xi,X2) can 
be calculated on the basis of the corpus f{xi,X2). However, right now, this corpus is not 
available! Putting the example in the framework of the EM algorithm (see Definition I12() . 
the set of incomplete-data types is 

:y = {2,3,4,5,6,7,8,9, 10,11,12} 
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whereas the set of complete-data types is X. We also know the set of analyzes for each 
incomplete-data type y y 

A{y) = {{xi,X2) £ X \ xi + X2 = y} 

As in Example |1J we are especially interested in an estimate of the (slightly restricted) 
complete-data model M1/2 which assumes that the numbers appearing on the first and 
second die are statistically independent. So, for this case, a randomly generated starting in- 
stance Po{xi,X2) of the complete-data model is simply the product of a randomly generated 
probability distribution ^01(2^1) for the numbers appearing on the first dice, and a randomly 
generated probability distribution Po2{x2) for the numbers appearing on the second dice 

Po{xi,X2) =P0l{xi) ■P02{X2) 

The following tables display some randomly generated numbers for poi P02 





Xl 


^02(^2) 


X2 


0.18 


1 


0.22 


1 


0.19 


2 


0.23 


2 


0.16 


3 


0.13 


3 


0.13 


4 


0.16 


4 


0.17 


5 


0.14 


5 


0.17 


6 


0.12 


6 



Using the random numbers for poi{xi) and ^02(3^2), a starting instance po of the complete-data 
model ^Al/2 is calculated as follows 



For example. 



Po{xi 


,^2) 


X2 = I 


X2 = 2 X2 = 3 


3:2 = 4 


X2 = 5 X2 = 6 


Xl = 


= 1 


0.0396 


0.0414 0.0234 


0.0288 


0.0252 0.0216 


Xl = 


= 2 


0.0418 


0.0437 0.0247 


0.0304 


0.0266 0.0228 


Xl = 


= 3 


0.0352 


0.0368 0.0208 


0.0256 


0.0224 0.0192 


Xl = 


= 4 


0.0286 


0.0299 0.0169 


0.0208 


0.0182 0.0156 


Xl = 


= 5 


0.0374 


0.0391 0.0221 


0.0272 


0.0238 0.0204 


Xl = 


= 6 


0.0374 


0.0391 0.0221 


0.0272 


0.0238 0.0204 




Po(l,3) = 


-^02 (3) = 


0.18-0.13 


= 0.0234 




Po(2,2) = 


Poi(2) -^02(2) = 


0.19 • 0.23 


= 0.0437 




Po(3,l) = 


m(3) -^02(1) = 


0.16-0.22 


= 0.0352 



So, we are ready to start the procedure of the EM algorithm. 

First EM iteration. In the E-step, we shall compute the complete-data corpus fq 

expected by q : = po- For this purpose, the probability of each incomplete-data type given the 
starting instance pq of the complete-data model has to be computed (see Definition 1121 *) 

Po{y)= Po{xi,X2) 
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The above displayed numbers for poixi,X2) yield the following instance of the incomplete-data 
model 



poiy) 


y 


0.0396 


2 


0.0832 


3 


0.1023 


4 


0.1189 


5 


0.1437 


6 


0.1672 


7 


0.1272 


8 


0.0867 


9 


0.0666 


10 


0.0442 


11 


0.0204 


12 



For example, 
Po(4) 



= po{l, 3) + po{2, 2) + pq{3, 1) = 0.0234 + 0.0437 + 0.0352 = 0.1023 



So, the complete-data corpus expected hy q := po is calculated as follows (see line (3) of the 
EM procedure given in Definition I13|) 



fqixi,X2) 


X2 = 1 


X2 = 2 


X2 = 3 


X2 = 4 


X2 = 5 


X2 = 6 


Xl = 1 


3790 


3735.95 


2337.03 


2530.23 


2104.91 


2290.74 


xi = 2 


3772.05 


4364.45 


2170.03 


2539.26 


2821 


2495.63 


Xl = 3 


3515.53 


3233.08 


1737.39 


2714.95 


2451.85 


1903.39 


Xl = 4 


2512.66 


2497.49 


1792.29 


2276.72 


1804.26 


1460.92 


Xl = 5 


3123.95 


4146.66 


2419.01 


2696.47 


2228.84 


2712 


Xl = 6 


3966.37 


4279.79 


2190.88 


2547.24 


3164 


3673 



For example. 



/,(1,3) 
/,(2,2) 



/(4) 
/(4) 



Po(l,3) 
Po(4) 

Po(2,2) 
Po(4) 

Po(3, 1) 
Po(4) 



10217- 



10217 • 



10217 • 



0.0234 
0.1023 
0.0437 
0.1023 
0.0352 
0.1023 



2337.03 



4364.45 



3515.53 



(The frequency /(4) of the dice sum 4 is distributed to its analyzes (1,3), (2,2), and (3,1), 
simply by correlating the current probabilities g = po of the analyses...) 



In the M-step, we shall compute a maximum- likelihood estimate pi := j5 of the complete-data 
model All/2 on the complete-data corpus fq. This can be done along the lines of Examples 01 
andjll Note: This is more or less the trick of the EM- algorithm! If it appears to be difficult 
to compute a maximum-likelihood estimate of an incomplete- data model then the EM algo- 
rithm might solve your problem. It performs a sequence of maximum-likelihood estimations on 
complete- data corpora. These corpora contain in general more complex data, but nevertheless, 
it might be well-known, how one has to deal with this data! In detail: On the basis of the 
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complctc-data corpus fq (where currently q = po), the corpus fqi of outcomes of the first die 
is calculated as fqi{xi) = J2x2 fgi^^ii -^2)5 whereas the corpus of outcomes of the second die is 
calculated as fq2{x2) = Y^xi /q(^i!^2)- The following tables display them: 





Xi 


fq2{x2) 


X2 


16788.86 


1 


20680.56 


1 


18162.42 


2 


22257.42 


2 


15556.19 


3 


12646.63 


3 


12344.34 


4 


15304.87 


4 


17326.93 


5 


14574.86 


5 


19821.28 


6 


14535.68 


6 



For example, 

/.2(1) 



/,(1, 1) + /,(1, 2) + /,(1, 3) + /,(1, 4) + fq{l, 5) + /,(1, 6) 

3790 + 3735.95 + 2337.03 + 2530.23 + 2104.91 + 2290.74 = 16788.86 

/g(l, 1) + /,(2, 1) + /,(3, 1) + /,(4, 1) + /,(5, 1) + fq{Q, 1) 

3790 + 3772.05 + 3515.53 + 2512.66 + 3123.95 + 3966.37 = 20680.56 



The sizes of both corpora are still \fqi\ = \fq2\ = 1/1 = 100 000, resulting in the following 
relative-frequency estimates {pn on fq\ respectively pi2 on 7^2) 



Pii{xi) 


Xi 


P12{X2) 


X2 


0.167889 


1 


0.206806 


1 


0.181624 


2 


0.222574 


2 


0.155562 


3 


0.126466 


3 


0.123443 


4 


0.153049 


4 


0.173269 


5 


0.145749 


5 


0.198213 


6 


0.145357 


6 



So, the following instance is the maximum- likelihood estimate of the model A^i/2 on fq 



Pl{xi,X2) 


X2 = 1 


a;2 = 2 


X2 = 3 


X2 = 4 


X2 = 5 


X2 = 6 


Xl 


= 1 


0.0347204 


0.0373677 


0.0212322 


0.0256952 


0.0244696 


0.0244038 


Xl 


= 2 


0.0375609 


0.0404247 


0.0229692 


0.0277973 


0.0264715 


0.0264003 


Xl 


= 3 


0.0321711 


0.034624 


0.0196733 


0.0238086 


0.022673 


0.022612 


Xl 


= 4 


0.0255287 


0.0274752 


0.0156113 


0.0188928 


0.0179917 


0.0179433 


Xl 


= 5 


0.035833 


0.0385651 


0.0219126 


0.0265186 


0.0252538 


0.0251858 


Xl 


= 6 


0.0409916 


0.044117 


0.0250672 


0.0303363 


0.0288893 


0.0288116 



For example. 



Pi(l,2) = 

PI (2,1) = 
Pi (2, 2) = 



Pii(l) -^12(1) 
Pii(l)-m2(2) 

Pll{2) ■ P12{1) 

Pii(2)-pi2(2) 



0.167889 • 0.206806 

0.167889 • 0.222574 
0.181624 • 0.206806 
0.181624 • 0.222574 



0.0347204 

0.0373677 
0.0375609 
0.0404247 



So, we are ready for the second EM iteration, where an estimate p2 is calculated. If we con- 
tinue in this manner, we will arrive finally at the 
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1584th EM iteration. The estimate which is calculated here is 



Pl584,l(^l) 


Xi 


Pl584,2(a^2) 


X2 


0.158396 


1 


0.239281 


1 


0.141282 


2 


0.260559 


2 


0.204291 


3 


0.104026 


3 


0.0785532 


4 


0.111957 


4 


0.172207 


5 


0.134419 


5 


0.24527 


6 


0.149758 


6 



yielding 



Pl584(a^l 


,X2) 


X2 = 1 


X2 = 2 


2:2 = 3 


X2 = 4 


X2 = 5 


X2 = 6 


Xl = 


1 


0.0379012 


0.0412715 


0.0164773 


0.0177336 


0.0212914 


0.0237211 


Xl = 


2 


0.0338061 


0.0368123 


0.014697 


0.0158175 


0.018991 


0.0211581 


Xl = 


3 


0.048883 


0.0532299 


0.0212516 


0.0228718 


0.0274606 


0.0305942 


Xl = 


4 


0.0187963 


0.0204678 


0.00817158 


0.00879459 


0.0105591 


0.011764 


Xl = 


5 


0.0412059 


0.0448701 


0.017914 


0.0192798 


0.0231479 


0.0257894 


Xl = 


6 


0.0586885 


0.0639074 


0.0255145 


0.0274597 


0.032969 


0.0367312 



In this example, more EM iterations will result in exactly the same re-estimates. So, this is 
a strong reason to quit the EM procedure. Comparing pi584,i and pi584,2 with the results of 
Example 01 (^-f^Jfii.' where we have assumed that a complete- data corpus is given to us!), we see 
that the EM algorithm yields pretty similar estimates. 



5 Probabilistic Context-Free Grammars 

This Section provides a more substantial example based on the context-free grammar or 
CFG formalism, and it is organized as follows: First, we will give some background informa- 
tion about CFGs, thereby motivating that treating CFGs as generators leads quite naturally 
to the notion of a probabilistic context-free grammar (PCFG). Second, we provide some ad- 
ditional background information about ambiguity resolution by probabilistic CFGs, thereby 
focusing on the fact that probabilistic CFGs can resolve ambiguities, if the underlying CFG 
has a sufhciently high expressive power. For other cases, we are pin-pointing to some use- 
ful grammar-transformation techniques. Third, we will investigate the standard probability 
model of CFGs, thereby proving that this model is restricted in almost all cases of inter- 
est. Furthermore, we will give a new formal proof that maximum-likelihood estimation of a 
CFG's probability model on a corpus of trees is equal to the well-known and especially simple 
treebank-training method. Finally, we will present the EM algorithm for training a (manually 
written) CFG on a corpus of sentences, thereby pin-pointing to the fact that EM training 
simply consists of an iterative sequence of treebank-training steps. Small toy examples will 
accompany all proofs that are given in this Section. 



Background: Probabilistic Modeling of CFGs 



Being a bit sloppy (see e.g. Hopcroft and Ullman (1979) for a formal definition), a CFG 



simply consists of a finite set of rules, where in turn, each rule consists of two parts being 
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separated by a special symbol " — > ", the so-called rewriting symbol. The two parts 
of a rule are made up of so-called terminal and non-terminal symbols: a rule's left-hand 
side simply consists of a single non-terminal symbol, whereas the right-hand side is a finite 
sequence of terminal and non-terminal symbols''. Finally, the set of non-terminal symbols 
contains at least one so-called starting symbol. CFGs are also called phrase-structure 
grammars, and the formalism is equivalent to Backus-Naur forms or BNF introduced 



by [Backus (1959) In computational linguistics, a CFG is usually used in two ways 

• as a generator: a device for generating sentences, or 

• as a parser: a device for assigning structure to a given sentence 

In the following, we will briefly discuss these two issues. First of all, note that in natural 
language, words do not occur in any order. Instead, languages have constraints on word 
order^. The central idea underlying phrase-structure grammars is that words are organized 
into phrases, i.e., grouping of words that form a unit. Phrases can be detected, for example, 
by their ability (i) to stand alone (e.g. as an answer of a wh-question), (ii) to occur in 
various sentence positions, or by their ability (iii) to show uniform syntactic possibilities for 
expansion or substitution. As an example, here is the very first context-free grammar parse 



tree presented by Chomsky (1956) 



Sentence 




book 



As being displayed, Chomsky identified for the sentence "the man took the book" (encoded 
in the leaf nodes of the parse tree) the following phrases: two noun phrases, "the man" 
and "the book" (the figure displays them as NP subtrees), and one verb phrase, "took the 
book" (displayed as VP subtree). The following list of sentences, where these three phrases 
have been substituted or expanded, bears some evidence for Chomsky's analysis: 

he 
the man 
< the tall man 

the very tall man 
the tall man with sad eyes 

Chomsky's parse tree is based on the following CFG: 

Sentence — > NP VP 
NP — > the man 
NP — > the book 
VP — > Verb NP 
Verb — > took 

^As a consequence, the terminal and non-terminal symbols of a given CFG form two finite and disjoint sets. 

*Note, however, that so-called free-word-order languages (like Czech, German, or Russian) permit many 
different ways of ordering the words in a sentence (without a change in meaning). Instead of word order, these 
languages use case markings to indicate who did what to whom. 



it 

the book 

took ^ the interesting book 

the very interesting book 
the very interesting book with 934 pages 
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The CFG's terminal symbols are {the, man, took, book}, its non-terminal symbols are 
{Sentence, NP, VP, Verb}, and its starting symbol is "Sentence". Now, we are coming back 
to the beginning of the section, where we mentioned that a CFG is usually thought of in 
two ways: as a generator or as a parser. As a generator, the example CFG might produce 
the following series of intermediate parse trees (only the last one will be submitted to the 
generator's output): 



Sentence 



Sentence 
Sentence 



Sentence 




Sentence 



NP VP 
the man Verb NP 
Sentence 



NP 



VP 



the 




the 



took 




took 



Starting with the starting symbol, each of these intermediate parse trees is generated by 
applying one rule of the CFG to a suitable non-terminal leaf node of the previous parse 
tree, thereby adding the CFG rule as a local tree. The generator stops, if all leaf nodes of 
the current parse tree are terminal nodes. The whole generation process, of course, is non- 
deterministic, and this fact will lead us later on directly to probabilistic CFGs. As a parser, 
instead, the example CFG has to deal with an input sentence like 

"the man took the book" 

Usually, the parser starts processing the input sentence by assigning the words some local 
trees: 

NP Verb NP 

the man took the book 



Then, the parser tries to add more local trees, by processing all the non-terminal nodes found 
in previous steps: 

VP 

NP 

Verb 

m I 
took 



the 




Doing this recursively, the parser provides us with a parse tree of the input sentence: 



Sentence 




book 



The example CFG is unambiguous for the given input sentence. Note, however, that this is 
far away from being the common situation. Usually, the parser stops, if all parse trees of the 
input sentence have been generated (and submitted to the output). 
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Now, we demonstrate that the fact that we can understand CFGs as generators leads 
directly to the probabilistic context-free grammar or PCFG formalism. As we already 
demonstrated for the generation process, the rules of the CFG serve as local trees that are 
incrementally used to build up a full parse tree (i.e. a parse tree without any non-terminal 
leaf nodes). This process, however, is non-deterministic: At most of its steps, some sort of 
random choice is involved that selects one of the different CFG rules which can potentially 
be appended to one of the non-terminal leaf nodes of the current parse tree^. Here is an 
example in the context of the generation process displayed above. For the CFG underlying 
Chomsky's very first parse tree, the non-terminal symbol NP is the left-hand side of two rules: 

NP — > the man 
NP — !■ the book 



Clearly, when using the underlying CFG as a generator, we have to select either the first 
or the second rule, whenever a local NP tree shall be appended to the partial-parse tree given 
in the actual generation step. The choice might be either fair (both rules are chosen with 
probability 0.5) or unfair (the first rule is chosen, for example, with probability 0.9 and the 
second one with probability 0.1). In either case, a random choice between competing rules 
can be described by probability values which are directly allocated to the rules: 

< p( NP — > the man ) < 1 and < p( NP — > the book ) < 1 



such that 



P(nP — » the man) -|- p( NP — ►the book) = 1 



Now, having these probabilities at hand, it tTirns out that it is even possible to predict how 
often the generator will produce the one or the other of the following alternate partial-parse 
trees: 

Sentence Sentence 




NP 



took 




NP 



took 



p{ NP — > the man ) ■ 100% 



p{ NP — > the book ) • 100% 



In turn, having this result at hand, we can also predict how often the generator will produce 
full-parse trees, for example, Chomsky's very first parse tree, or the parse tree of the sentence 
"the book took the book" : 

^Clearly, the final output of the generator is directly affected by the specific rule that has been selected 
by this random choice. Note also that there is another type of uncertainty in the generation process, playing, 
however, only a minor role: the specific place at which a CFG rule is to be appended does obviously not 
affect the generator's final output. So, these places can be deterministically chosen. For the generation process 
displayed above, for example, we decided to append the local trees always to the left-most non-terminal node 
of the actual partial-parse tree. 
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Sentence Sentence 



VP 





p{ NP — > the man ) ■ p( NP — >■ the book ) • 100% p{ NP — > the book ) • p( NP — v the book ) • 100% 



So, if p(np — > the man) = 0.9 and p(np — > the book) = 0.1, then it is nine times more likely 
that the generator produces Chomsky's very first parse tree. In the following, we are trying 
to generalize this result even a bit more. As we saw, there are three rules in the CFG, which 
cause no problems in terms of uncertainty. These are: 

Sentence — > NP VP 
VP — > Verb NP 
Verb — > took 

To be more specific, we saw that these three rules have been always deterministically added to 

the partial-parse trees of the generation process. In terms of probability theory, determinism 
is expressed by the fact that certain events occur with a probability of one. In other words, 
a generator selects each of these rules with a probability of 100%, either when starting the 
generation process, or when expanding a VP or a Verb non-terminal node. So, we let 

p{ Sentence — > NP VP ) = 1 
p{ VP — V Verb NP ) = 1 
p{ Verb — >■ took ) = 1 

The question is now: Have we won something by treating also the deterministic choices as 
probabilistic events? The answer is yes: A closer look at our example reveals that wc can 
now predict easily how often the generator will produce a specific parse tree: The likelihood 
of a CFG's parse tree can be simply calculated as the product of the probabilities of all rules 
occurring in the tree. For example: 



Sentence 



NP 



VP 



the 



Verb 



took 




j;(s M' \'P) ■ l){>iP . 11k- man) • /)( \ (tI> M' ) ■ /)( \erl) Icjok) ■ /)( -Nl' IIk- hook) 



To wrap up, wc investigated the small CFG underlying Chomsky's very first parse tree. 
Motivated by the fact that a CFG can be used as a generator, we assigned each of its rules 
a weight (a non-negative real number) such that the weights of all rules with the same left- 
hand side sum up to one. In other words, all CFG fragments (comprising the CFG rules with 
the same left-hand side) have been assigned a probability distribution, as displayed in the 
following table: 
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CFG rule 



Rule probability 



Sentence — > NP VP 
NP — > the man 
NP — > the book 

VP — > Verb NP 

Verb — > took 




summing to 1 



As a result, the Ukelihood of each of the grammar's parse trees (when using the CFG as 
a generator) can be calculated by multiplying the probabilities of all rules occurring in the 
tree. This observation leads directly to the standard definition of a probabilistic context-free 
grammar, as well as to the definition of probabilities for parse-trees. 

Definition 16 A pair < G,p > consisting of a context-free grammar G and a probability dis- 
tribution p: X ^ [0,1] on the set X of all finite full-parse trees of G is called a probabilistic 
context-free grammar or PCFG, if for all parse trees x & X 



p{x) 



n p^'^) 

reG 



fr{x) 



Here, fr{x) is the number of occurrences of the rule r in the tree x, and p{r) is a probability 
allocated to the rule r, such that for all non-terminal symbols A 

P{r) = 1 

VdGA 

where Ga = { r E G | lhs[r) = A} is the grammar fragment comprising all rules with the 
left-hand side A. In other words, a probabilistic context-free grammar is defined by a context- 
free grammar G and some probability distributions on the grammar fragments Ga, thereby 
inducing a probability distribution on the set of all full-parse trees. 

So far, we have not checked for our example that the probabilities of all full-parse trees are 
summing up to one. According to Definition ll6l however, this is the fundamental property 
of PCFGs (and it should be really checked for every PCFG which is accidentally given to us). 
Obviously, the example grammar has four full-parse trees, and the sum of their probabilities 
can be calculated as follows (by omitting all rules with a probability of one): 



p{X) = p(nP > the man ) • p( NP — 

p{ NP > the book ) • p( NP — 

p[ NP > the man ) • p( NP — 

p{ NP > the book ) • p( NP — 

= NP > the man ) -|- p( NP 

(^p{ NP > the man ) -|- p( NP 



the book ) -|- 

> the book ) -|- 
the man ) -|- 

> the man ) 
— > the book 

— > the book 



■ p(np 

■ p(np 



the book ) -|- 
the man ) 



For the last equation, we are using three times that p is a probability distribution on the 
grammar fragment G ^-p , i.e., we are exploiting that p{ np — > the man )-|-p( np — > the book ) = 1. 

The following examples show that we really have to do this kind of "probabilistic grammar 
checking" . We are presenting two non-standard PCFGs: The first one consists of the rules 
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S — >NP sleeps (1.0) 
S — > John sleeps (0.7) 
NP — > John (0.3) 

The second one is a well-known highly-recursive grammar (jChi and Geman 1998|) . and it is 
given by 

S ^SS (q) ^ith0.5<g<l 

s — > a (1-q) 

What is wrong with these grammars? Well, the first grammar provides us with a probability 
distribution on its full-parse trees, as can be seen here 



NP sleeps 
I 

John 




1.0-0.3 = 0.3 0.7 



On each of its grammar fragments, however, the rule probabilities do not form a probability 
distribution (neither on Gg nor on G^p). The second grammar is even worse: We do have 
a probability distribution on Gg , but even so, we do not have a probability distribution on 
the set of full-parse trees (because their probabilities are summing to less than one^'^). 

^"This can be proven as follows: Let T be the set of all finite full-parse trees that can be generated by the 
given context-free grammar. Then, it is easy to verify that tt ~ p{T) is a solution of the following equation 

TV=1 — q + q- TV-TT 

S S 
Here, 1 — q is the probability of the tree , , whereas q ■ n ■ tv corresponds to the forest ^.^^ 

a r r 

It is easy to check that the derived quadratic equation has two solutions: tvi — 1 and tv2 = Note that it is 

quite natural that two solutions arise: The set of all "infinite full-parse trees" matches also our under-specified 

approach of calculating tt. Now, in the case of 0.5 < q < 1, it turns out that the set of infinite trees is allocated 

a proper probability tti = 1. (For the special case q — 1, this can be intuitively verified: The generator will 

never touch the rule S — > a , and therefore, this special PCFG produces infinite parse trees only.) As a 

consequence, all finite full-parse trees is allocated the total probability -K2. In other words, p{T) — < 1. 

In a certain sense, however, we are able to repair both grammars. For example, 

S — > NP sleeps (0.3) 
S — > John sleeps (0.7) 
NP — > John (1.0) 

is the standard-PCFG counterpart of the first grammar, where 

S S S (1-q) „ , ^ , 1 

„ , ? with 0.5 < g < 1 

S — > a (q) 

is a standard-PCFG counterpart of the second grammar: The first grammar and its counterpart provide us with 
exactly the same parse-tree probabilities, while the second grammar and its counterpart produce parse-tree 
probabilities, which are proportional to each other. Especially for the second example, this interesting result 
is a special case of an important general theorem recently proven by |Nederhof and Satta (2003) | Sloppily 
formulated, their Theorem 7 states that: For each weighted CFG (defined on the basis of rule weights 
instead of rule probabilities) is a standard PCFG with the same symbolic backbone, such that (i) the parse-tree 
probabilities (produced by the PCFG) are summing to one, and (ii) the parse-tree weights (produced by the 
weighted CFG) are proportional to the parse-tree probabilities (produced by the PGFG). 

As a consequence, we are getting what we really want: Applied to ambiguity resolution, the original grammars 
and their counterparts provide us with exactly the same maximum-probability-parse trees. 
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Background: Resolving Ambiguities with PCFGs 



A property of most formalizations of natural language in terms of CFGs is ambiguity: the 
fact that sentences have more than one possible phrase structure (and therefore more than 
one meaning). Here are two prominent types of ambiguity: 

Ambiguity caused by prepositional-phrase attachment: 

s s 




Mary 



with a telescope 




Mary 



with a telescope 



Ambiguity caused by conjunctions: 

s 




the boy 



CONJ 
and the girl 



the boy 



As usual in computational linguistics, some phrase structures have been displayed in ab- 

NP 

breviated form: For example, the term is used as a short form for the parse 



tree 



NP 



DET N 

I I 
the mother 



and the term 




of the boy 



is a place holder for the even more 




complex parse tree 



In both examples, the ambiguity is caused by the fact that the underlying CFG contains 
recursive rules, i.e., rules that can be applied an arbitrary number of times. Clearly, 
the rules NP — > NP CONJ NP and NP — > NP PP belong to this type, since they can 
be used to generate nominal phrases of an arbitrary length. The rules VP — > V NP and 
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PP — > P NP, however, might be also called (indirectly) recursive, since they can generate 
verbal and prepositional phrases of an arbitrary length (in combination with NP — > NP PP). 
Besides ambiguity, recursivity makes it also possible that two words that are generated by 
the same CFG rule (i.e. which are syntactically linked) can occur far apart in a sentence: 

The bird with the nice brown eyes and the beautiful tail feathers catches a worm. 

These types of phenomena are called non-local dependencies, and it is important to note 
that non-local phenomena (which can be handled by CFGs) are beyond the scope of many 
popular models that focus on modeling local dependencies (such as n-gram, Markov, and 
hidden Markov models^^). So, a part-of-speech tagger (based on a HMM model) might have 
difficulties with sentences like the one we mentioned, because it will not expect that a singular 
verb occurs after a plural noun. 

Having this at hand, of course, the central question is now: Can PCFGs handle ambiguity? 
The somewhat surprising answer is: Yes, but the symbolic backbone of the PCFG plays a 
major role in solving this difficult task. To be a bit more specific, the CFG underlying the 
given PCFG has to have some good properties, or the other way round, probabilistic modeling 
of some "weak" CFGs may result in PCFGs which can not resolve the CFG's ambiguities. 
From a probabilistic modeler's point of view, there is really some non-trivial relation between 
such tasks as "writing a formal grammar" and "modeling a probabilistic grammar". So, wc 
are convinced that formal-grammar writers should help probabilistic-grammar modelers, and 
the other way round. 

To exemplify this, we will have a closer look at the examples above, where we presented 
two common types of ambiguity. In general, a PCFG resolves ambiguity (i) by calculating 
all the full parse-trees of a given sentence (using the symbolic backbone of the CFG), and 
(ii) by allocating probabilities to all these trees (using the rule probabilities of the PCFG), 
and finally (iii) by choosing the most probable parse as the analysis of the given sentence. 
According to this procedure, we are calculating, for example 




Mary 



with a telescope 



p( S — » NP VP ) • p 



p(VP 

p{v - 




V NP PP 




Mary 



with a telescope 



p{S — » NP VP ) • p 



NP 



Peter 



p(VP — vVNP)-p(NP — >NPPP) 

/ NP \ / PP 
p{v -^saw) - p I I -p 



Mary 



with a telescope ^ 



^^It is well-known, however, that a CFG without center-embeddings can be transformed to a regular 
grammar (the symbolic backbone of a hidden Markov model) . 



33 



Comparing the probabilities for these two analyzes, we are choosing the analysis at the left- 
hand side of this figure, if 

p(vp — ►VNPPP)>p(vP — >vnp)-p(np — >NPPP) 

So, in principle, the PP-attachment ambiguity encoded in this CFG can be solved by a 

probabilistic model built on top of this CFG. Moreover, it is especially nice that such a 
PCFG resolves the ambiguity by looking only at those rules of the CFG, which directly cause 
the PP-attachment ambiguity. 

So far, so good: We are able to use PCFGs in order to select between competing analyzes of 
a sentence. Looking at the second example (ambiguity caused by conjunctions), however, we 
are faced with a serious problem: Both trees have a different structure, but exactly the same 
context-free rules are used for generating these different structures. As a consequence, both 
trees are allocated the same probability (independently from the specific rule probabilities 
which might have been offered to you by the very best estimation methods). So, any PCFG 
based on the given CFG is unable to resolve the ambiguity manifested in the two trees. 

Here is another problem. Using the grammar underlying our first example, the sentence 
"the girl saw a bird on a tree" has the following two analyzes 




saw a bird on a tree 



a bird on a tree 

Comparing the probabilities for these two analyzes, we are choosing the analysis at the left- 
hand side of this figure, if 

p(vP — >VNP)-p(nP — >NPPP)>p(vP — >VNPPP) 

Relating this result to the disambiguation result of the sentence "John saw Mary with a 
telescope", the prepositional phrases are attached in both cases either to the verbal phrase 
or to the nominal phrase. Instead, it seems to be more plausible that the PP "with the 
telescope" is attached to the verbal phrase, whereas the PP "on a tree" is attached to the 
noun phrase. 

Obviously, there is only one possible solution to this problem: We have to re-write the 
given CFG in such a way that a probabilistic model will be able to assign different probabilities 
to different analyzes. For our last example, it is sufficient to enrich the underlying CFG with 
some simple PP markup, enabling in principle that 

p( VP — ► V NP PP-ON ) < p( VP — ► V NP ) • p( NP — » NP PP-ON ) 
p( VP — > V NP PP-WITH ) > p( VP — ► V NP ) • p( NP — > NP PP-WITH ) 

Of course, other linguistically more sophisticated modifications of the original CFG (that 
handle e.g. agreement information, sub-cat frames, selectional preferences, etc) are also wel- 
come. Our only request is that the modified CFGs lead to PCFGs which are able to resolve 
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the different types of ambiguities encoded in the original CFG. Now, writing and re-writing 
a formal grammar is a job that grammar writers can do probably much better than modelers 
of probabilistic grammars. In the past, however, writers of formal grammars seemed to be 
uninterested in this specific task, or they are still unaware of its existence. So, modelers of 
PCFGs regard it nowadays also as an important part of their job to transform a given CFG 
in such a way that probabilistic versions of the modified CFG are able to resolve ambiguities 
of the original CFG. During the last years, a bunch of automatic grammar-transformation 
techniques have been developed, which offer some interesting solutions to this quite complex 



problem. Where the work of Klein and Manning (2003) describes one of the latest approaches 
to semi-automatic grammar-transformation, the parent-encoding technique introduced by 



Johnson (1998) is the earliest and purely automatic grammar-transformation technique: For 
each local tree, the parent's category is appended to all daughter categories. Using the ex- 
ample above, where we showed that conjunctions cause ambiguities, the parent-encoded trees 
are looking as follows: 





NP.NP 



CONJ.NP 
I 

and 



NP.NP 



the mother 



NP.NP 



the girl 



NP.NP 



the boy 



CONJ.NP 
I 

and 



NP.NP 



the girl 




the boy 



Clearly, parent-encoding of the original trees may result in different probabilities of the trans- 
formed trees: In this example, we will choose the analysis at the left-hand side, if 



P(NP.S — > NP.NP PP.NP) •^(nP.PP — > NP.NP CONJ.NP NP.NP) • p 



is more likely than 



p( NP.NP — ► NP.NP pp. NP) • p(nP.S — ► NP.NP CONJ.NP NP..NP) • p 



NP.NP 



the boy 



NP.PP 



the boy 



As in the example before, it is again nice to see that these probabilities are pin-pointing 
exactly at those rules of the underlying grammar which have introduced the ambiguity. 

In the rest of the section, we will present the notion of treebank grammars, which can be 
informally described as PCFGs that are constructed on the basis of a corpus of full-parse trees 

(EE 

arniak 1996|) . We will demonstrate that treebank grammars can resolve the ambiguous 
sentences of the treebank (as well as ambiguous similar sentences), if the treebank mark-up 
is rich enough to distinguish between the different types of ambiguities that are encoded in 
the treebank. 



Definition 17 For a given treebank, i.e., for a non-empty and finite corpus of full-parse 
trees, the treebank grammar < G,p > is a PCFG defined by 
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(i) G is the context-free grammar read off from the treebank, and 



(ii) p is the probability distribution on the set of full-parse trees ofG, induced by the following 
specific probability distributions on the grammar fragments G^; 



p{r) 



fir) 



for all r S Ga 



Here, f{r) is the number of times a rule r ^ G occurs in the treebank. 

Note: Later on (see Theorem 1 10() . we will show that each treebank grammar is the unique 
maximum-likelihood estimate of G's probability model on the given treebank. So, it is espe- 
cially guaranteed that p is a probability distribution on the set of full-parse trees of G, and 
that < G, P > is a standard PCFG (see Definition [TB|) . 

Example 6 We shall now consider a treebank given by the following 210 full-parse trees: 
100 X ti: 5 X t2: 




Mary 



with a telescope 




Mary 



with a telescope 



100 X ta; 



5 X t^: 





a bird on a tree 



a bird 



on a tree 



We shall (i) generate the treebank grammar and (ii) using this treebank grammar, we shall 
resolve the ambiguities of the sentences occurring in the treebank. 

Ad (i): The following table displays the rules r of the CFG encoded in the given treebank 
(thereby assuming for the sake of simplicity that the NP and PP non-terminals expand directly 
to terminal symbols), the rule frequencies /(r) i.e. the number of times a rule occurs in the 
treebank, as well as the rule probabilities p{r) as defined in Definition 1171 
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CFG rule 



S — > NP VP 

VP — ► V NP PP-WITH 

VP — ► V NP PP-ON 

VP — ► V NP 

NP > Peter 

NP » Mary 

NP ►a bird 

NP > NP PP-WITH 

NP — > NP PP-ON 



PP-WITH 



PP-ON 



with a telescope 



on a tree 



V 



Rule frequency Rule probability 



100 + 5 + 100 + 5 

100 

5 

5 + 100 
100 + 5 

100 + 5 + 100 + 5 

100 + 5 

5 

100 

100 + 5 
100 + 5 

100 + 5 + 100 + 5 



210 
210 



100 
210 

5 
210 
105 
210 



105 
525 
210 
525 
105 
625 

5 
525 
100 
525 



105 
105 



105 
105 



210 
210 



: 1.000 

: 0.476 
; 0.024 
: 0.500 

: 0.200 
: 0.400 
: 0.200 
! 0.010 
; 0.190 

: 1.000 

: 1.000 
: 1.000 



Ad (ii): As we have already seen, the treebank grammar selects the full-parse tree ti of the 
sentence "Peter saw Mary with a telescope" , if 

p( VP > V NP PP-WITH ) > p( VP > V NP ) • p( NP — ► NP PP-WITH ) 

Using the approximate probabilities for these rules, this is indeed true: 0.476 > 0.500 • 0.010. 
(Note that exactly the same argument can be applied to similar but more complex sentences 
like "Peter saw a bird on a tree with a telescope".) For the second sentence occurring in the 
treebank, "Mary saw a bird on a tree", the treebank grammar selects the full-parse tree ts, if 



p{vp — ► V NP PP-ON )< p( VP 
Indeed, this is the case: 0.024 < 0.500 • 0.190. 



VNP)-p(nP — ►NP PP-ON ) 



Maximum-Likelihood Estimation of PCFGs 

So far, we have seen that probabilistic context-free grammars can be used to resolve the 
ambiguities that are caused by their underlying context-free backbone, and we noted already 
that certain grammar-transformation are sometimes necessary to achieve this goal. All these 
application features are displayed in a "horizontal view" in Figure Q In what follows next, 
we will concentrate on the "vertical view" of this figure. To be more specific, we will focus 
on the following two questions. 

(i) how to characterize the probability model of a given context-free grammar, and 

(ii) second, how to estimate an appropriate instance of the context-free grammar's proba- 
bility model, if a corpus of input data is additionally given. 

The latter question is a tough one: It is true that the treebank-training method, which we 
defined in the previous section more or less heuristically, leads to PCFGs that are able to 
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input sentence 



Probabilistic Grammar 
foiinal 



gram mar 



grammar 
transformation 



instance of a 
probability model 



multiple output 

(analyzes) 



multiple output 
(transfomied analyzes) 



most probable 
analysis 



Applic ation 



A 



S3 




<K 



Maximum 
Likelihood 
Estimation 



..or EM if we can't do MLE.. 



probability 
model 



corpus of 
data 



either on: 

- sentences (unsupervised learning) 

- analyzes (supervised learning) 



Figure 7: Application and training of probabilistic context-free grammars 



resolve ambiguities. From what we have done so far in this section, however, we have no clear 
idea how the treebank-training method is related to maximum-likelihood estimation or the 
EM training method. So, let us start with the first question. 

Definition 18 Let G be a context-free grammar, and let X he the set of full-parse trees of 
G. Then, the probability model of G is defined by 

Pi^) — Y\. P(^)'^'^^^^ with ^ p{r) = 1 for all grammar fragments Ga \ 
■reG t&Ga J 

In other words, each instance p of the probability model Mg is associated to a probabilistic 
context-free grammar < G,p > having the context-free backbone G. (See Definition \ 161 for the 
meaning of the terms p{r), fr{x) and Ga-) 

As we have already seen, there are some non-standard PCFGs (like s — > s s (0.9), s — > a (o.i)) 
which do not induce a probability distribution on the set of full-parse trees. This gives us 
a rough idea that it might be quite difficult to characterize those PCFGs < G,p > which 
are associated to an instance p of the unrestricted probability model A4{X). In other words, 
it might be quite difficult to characterize G's probability model Mg- For calculating a 
maximum-likelihood estimate of Mg on a corpus fq- of full-parse trees, however, we have 
to solve this task. For example, if we are targeting to exploit the powerful Theorem^ we 
have to prove either that Mg equals the unrestricted probability model J^{X), or that the 
relative- frequency estimate p on fj- is an instance of Mg- For most context-free grammars 



Mg = {peM{X) 
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G, however, the following theorems show that this is a too simplistic approach of finding a 
maximum- likelihood estimate oi Mq- 

Theorem 8 Let G he a context-free grammar, and let p be the relative-frequency estimate on 
a non-empty and finite corpus fq-: X ^TZ of full-parse trees of G. Then p ^ Mq if 

(i) G can he read off from fq-, and 

(a) G has a full-parse tree Xoo S X that is not in fq-. 

Proof Assume that p & Mq- In what follows, we will show that this assumption leads to a 
contradiction. First, by definition of Mg, it follows that there are some weights < 7r(r) < 1 
such that 

, j-^--^^) for allx£X 

r£G 



We will show next that 7r(r) > for all r ^ G. 

Assume that there is a rule tq G G with 7r(ro) = 0. By (i), G can be read off from 
/r- So, fr contains a full-parse tree xq such that the rule tq occurs in xq, i.e., 

fr{xQ) > and froixo) > 

It follows bothp(xo) = ^^j^ > and p{xo) = UreG 7r(r)/'-(^o) = • • • 7r(ro)^'-o(^o) • • ■ 
0, which is a contradiction. 



Therefore 



p{x) = Yl 7r(r 

rGG 



> 



for all X € X 



On the other hand by (ii), there is the full-parse tree Xoo € X which is not in fq-. So, 
p{xoo) = '^^1^1°^ = 0, which is a contradiction to the last inequality q.e.d. 

Example 7 The treebank grammar described in Example\^is a context-free grammar of the 
type described in Theorem\^ 



The relative-frequency estimate p on the treebank is given by: 

and 



P{ti) =p{h) = ^ 



P{t2) = p{U) = — 



All other full-parse trees of the treebank grammar get allocated a probability of zero by the 
relative-frequency estimate. So, for example, p{xoo) = for 




Peter 



with a telescope 
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As a consequence, p can not be an instance of the probability model of the treebank grammar: 
Otherwise, p would allocate both full-parse trees ti and Xoo exactly the same probabilities 
(because ti and x^o contain exactly the same rules). 

Theorem 9 For each context-free grammar G with an infinite set X of full-parse trees, the 
probability model of G is restricted 

Mg^M{X) 

Proof First, each context-free grammar consists of a finite number of rules. Thus it is possible 
to construct a treebank, such that G is encoded by the treebank. (Without loss of generality, 
we are assuming here that all non-terminal symbols of G are reachable and productive.) So, 
let /-r be a non-empty and finite corpus of full-parse trees of G such that G can be read off 
from /t", and let p be the relative-frequency estimate on fq-. Second, using |,^[ = cxo and 
1/7- 1 < 00, it follows that there is at least one full-parse tree Xqo G X which is not in fq-. So, 
the conditions of Theorem |51 are met, and we are concluding that p ^ Mg- On the other 
hand, p E ^A{X). So, clearly, A4g 7^ M.{X). In other words, M.g is a restricted probability 
model q.e.d. 

After all, we might recognize that the previous results are not bad. Yes, the probability 
model of a given CFG is restricted in most of the cases. The missing distributions, how- 
ever, are the relative-frequency estimates on each treebank encoding the given CFG. These 
relative-frequency estimates lack the ability of any generalization power: They allocate each 
full-parse tree not being in the treebank a zero-probability. Obviously, however, we surely 
want a probability model that can be learned by maximum-likelihood estimation on a corpus 
of full-parse trees, but that is at the same time able to deal with full-parse trees not seen in 
the treebank. The following theorem shows that we have already found one. 

Theorem 10 Let fq—.X^lZbea non-empty and finite corpus of full-parse trees, and let 
< G,pT > be the treebank grammar read off from fq-. Then, pj- is a maximum-likelihood 
estimate of A4g on fx, i.e., 

L{fr;vr) = max L{fr;p) 
Moreover, maximum-likelihood estimation of A4g on fj- yields a unique estimate. 
This theorem is well-known. The following proof, however, is especially simple and (to the 



best of my knowledge) was given first by Prescher (2002) 

Proof First step: We will show that for all model instances p ^ Mg 

L{fr;p) = L{f;p) 

At the right-hand side of this equation, / refers to the corpus of rules that are read off from the 
treebank /-r, i.e., f{r) is the number of times a rule r ^ G occurs in the treebank; Similarly, 
p refers to the probabilities p{r) of the rules r € G. In contrast, at the left-hand side of the 
equation, p refers to the probabilities p{x) of the full-parse trees x £ X. The proof of the 
equation is relatively straight-forward: 
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/ \ frix) 

xeX \r£G ) 

= n n 

= n n p(r)^^("')-^^(") 

r6G 
reG 

In the 6*^^ equation, we simply used that /(r) (the number of times a specific rule occurs in 
the treebank) can be calculated by summing up all the /r(x) (the number of times this rule 
occurs in a specific full-parse tree x € A"): 

i{T)=Y^h{x).U{x) 

x&X 

So, maximizing L{f'j-;p) is equivalent to maximizing L{f;p). Unfortunately, Theorem^can 
not be applied to maximize the term L{f;p), because the rule probabilities p{r) do not form 
a probability distribution on the set of all grammar rules. They do form, however, proba- 
bility distributions on the grammar fragments Ga- So, we have to refine our result a bit more. 



Second step: We are showing here that 

L{f;p)=l[L{fA;p) 

A 

Here, each /a is a corpus of rules, read off from the given treebank, thereby filtering out all 
rules not having the left-hand side A. To be specific, we define 



fAir) 

Again, the proof is easy: 



/(r) if reGA 
else 



n n^'( 

A reGA 

n n 

A reGA 
A 
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Third step: Combining the first and second step, we conclude that 



L{fr;p) =l[L{fA;p) 

A 

So, maximizing L{f']-;p) is equivalent to maximizing Y[A-^ifA',p)- Now, a product is maxi- 
mized, if all its factors are maximized. So, in what follows, we are focusing on how to maximize 
the terms First of all, the corpus Ja comprises only rules with the left-hand side A. 

So, the value of L{fA;p) depends only on the values p{r) of the rules r G Ga- These values, 
however, form a probability distribution on Ga, and all these probability distributions on 
Ga have to be considered for maximizing L{fA',p)- It follows that we have to calculate an 
instance pA of the unrestricted probability model A4{Ga) such that 

L{fA;PA)= max L(/a;p) 
PGA4(Ga) 

In other words, we have to calculate a maximum-likelihood estimate of the unrestricted prob- 
ability model M.{Ga) on the corpus Ja- Fortunately, this task can be easily solved. According 
to Theorem ^ the relative- frequency estimate on the corpus Ja is our unique solution. This 
yields for the rules r G Ga 

. . ^ fA{r) _ f{r) 

Comparing these formulas to the formulas given in Definition 1171 we conclude that for all 
non-terminal symbols A 

PAi^) = pr{f) for all r G Ga 
So, clearly, the treebank grammar pq- is a serious candidate for a maximum-likelihood estimate 



of the probability model on fq-. Now, as Chi and Geman (1998) verified, the treebank 



grammar p^- is indeed an instance of the probability model Mq- So, combining all the results, 
it follows that 

L{fr;pr) = max L{fr;p) 
peMc 

Finally, since all pA G A4{Ga) are unique maximum-likelihood estimates, pq- G Mg is also a 
unique maximum-likelihood estimate q.e.d. 

EM Training of PCFGs 

Let us present first an overview of some theoretical work on using the EM algorithm for 
training of probabilistic context-free grammars. 

• From 1966 to 1972, a group around Leonard E. Baum invents the forward-backward 
algorithm for probabilistic regular grammars (or hidden Markov models) and formally 
proves that this algorithm has some good convergence properties. See, for example, the 



overview presented in Baum (1972) 



Booth and Thompson (1973) discover a (still nameless) constraint for PCFGs. For 



PCFGs fulfilling the constraint, the probabilities of all full-parse trees sum to one. 



o Dempster et al. (1977) invent the EM algorithm. 
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Baker (1979) invents the inside-outside algorithm (as a generalization of the forward- 
backward algorithm). The training corpus of this algorithm, however, is not allowed to 
contain more than one single sentence. 



Lari and Young (1990) generalize Baker's inside-outside algorithm (or in other words, 
they invent the modern form of the inside-outside algorithm): The training corpus of 
their algorithm may contain arbitrary many sentences. 



Pereira and Schabes (1992) use the inside-outside algorithm for estimating a PCFG for 



English on a partially bracketed corpus. 

Sanchez and Benedi (1997) and Chi and Geman (1998)|formally prove that for treebank 



grammars and grammars estimated by the EM algorithm, the probabilities of all full- 
parse trees sum to one. 



Prescher (2001) formally proves that the inside-outside algorithm is a dynamic program- 
ming instance of the EM algorithm for PCFGs. As a consequence, the inside-outside 
algorithm inherits the convergence properties of the EM algorithm (no formal proofs of 
these properties have been given by Baker and Lari and Young). 



Nederhof and Satta (2003)] discover that the PCFG standard-constraints ("the proba- 



bilities of the rules with the same left-hand side are summing to one") are dispensable 

The overview presents two interesting streams. First, it suggests that the forward-backward 
algorithm is a special instance of the inside-outside algorithm, which in turn is a special in- 
stance of the EM algorithm. Second, it appears to be worthy to reflect on our notion of a stan- 



dard probability model for context-free grammars (see the results of Booth and Thompson (1973) 



and Nederhof and Satta (2003) ). Surely, both topics are very interesting — Here, however, 
we would like to limit our selfs and refer the interested reader to the papers mentioned above. 
The rest of this paper is dedicated to the pure non-dynamic EM algorithm for PCFGs. We 
would like to present its procedure and its properties, thereby motivating that the EM algo- 
rithm can be successfully used to train a manually written grammar on a plain text corpus. 

As a first step, the following theorem shows that the EM algorithm is not only related to 
the inside-outside algorithm, but that the EM algorithm is also strongly connected with the 
treebank-training method on which we we have focused so far. 

Theorem 11 Let < G,pq > be a probabilistic context-free grammar, where po is an arbitrary 
starting instance of the probability model Mg- Let f : y ^ TZ be a non-empty and finite 
corpus of sentences of G (terminal strings that have at least one full-parse tree x ^ X). 
Then, applied to the PCFG < G,po > and the sentence corpus f, the EM Algorithm performs 



^^In our terms, their result can be expressed as follows. Let G be a context-free grammar, and let A4q be 
the probability model that disregards the PCFG standard-constraints 



Mh = <pe M{X) 



Obviously, it follows then that Ma C Mq. Exploiting their Corollary 8, however, it follows somewhat 
surprisingly that both models are even equal: Mq — Mo- As a trivial consequence, a maximum-likelihood 
estimate of the standard probability model Mc (on a corpus of trees or on a corpus of sentences) is also a 
maximum-likelihood estimate of the probability model Mq — as well as the other way round. 
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the following procedure 



(1) 
(2) 
(3) 



for each i = 1, 2, 3, ... do 



q ■■= Pi-i 



E-step (PCFGs): generate the treebank fq-^: X ^ TZ defined by 



(4) 

(5) 
(6) 
(V 



hqix) ■■= f {y) ■ q{x\y) where y = yield{x) 

M-step (PCFGs): read off the treebank grammar < G,pTq > 

Pi ■■= P% 
end // for each i 

print Po,Pi,P2,P3,--- 



Moreover, these EM re-estimates allocate the corpus f a sequence of corpus probabilities that 
is monotonic increasing 



Proof See Definition I13L and theorems El and EH 

Here is a toy example that exemplifies how the EM algorithm is applied to PCFGs. Re- 
mind first that we showed in Example 1^1 that a treebank grammar is (in principle) able to 
disambiguate correctly different forms of PP attachment. Remind also that we had to intro- 
duce some simple PP mark-up to achieve this result. Although we are choosing here a simpler 
example (so that the number of EM calculations is kept small), the example shall provide 
us with an intuition about why the EM algorithm is (in principle) able to learn "good" 
PCFGs. Again, the toy example presents a CFG having a PP-attachment ambiguity. This 
time, however, the training corpus is not made up of full-parse trees but of sentences of the 
grammar. Two types shall occur in the given corpus: Whereas the sentences of the first type 
are ambiguous, the sentences of the second type are not. Our simple goal is to demonstrate 
that the EM algorithm is able to learn from the unambiguous sentences how to disambiguate 
the ambiguous ones. It is exactly this feature that enables the EM algorithm to learn highly 
ambiguous grammars from real-world corpora: Although it is almost guaranteed in practice 
that sentences have on average thousands of analyzes, no sentence will hardly be completely 
unambiguous. Almost all sentences in a given corpus contain partly unambiguous informa- 
tion — represented for example in small sub-trees that the analyzes of the sentence share. 
By weighting the unambiguous information "high", and at the same time, by weighting the 
ambiguous information "low" (indifferently, uniformly or randomly) , the EM algorithm might 
output a PCFG that learned something, namely, a PCFG being able to select for almost all 
sentences the single analysis that fits best the information which is hidden but nevertheless 
encoded in the corpus. 

Example 8 We shall consider an experiment in which a manually written CFG is estimated 
by the EM algorithm. We shall assume that the following corpus of 15 sentences is given to 
us 



L{f;Po) < L{f-pi) < Lif;p2) < L{f;ps) < . . . 



fiy) y 



yi = 'Mary saw a bird on a tree" 
y2 = "a bird on a tree saw a worm 



5 yi 
10 y2 
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Using this text corpus, we shall compute the EM re-estimates of the following PCFG (that is 
able to parse all the corpus sentences) 



s — 


NP VP 


(1.00) 


VP - 


V NP 


(0.50) 


VP - 


V NP PP 


(0.50) 


NP - 


NP PP 


(0.25) 


NP - 


— > Mary 


(0.25) 


NP - 


— > a bird 


(0.25) 


NP - 


— > a worm 


(0.25) 


PP - 


— > on a tree 


(1.00) 


V — 


-5- saw 


(1.00) 



Note: As being displayed, the uniform distributions on the grammar fragments (Gs, Gyp, 
Gnp ■■■) serve in this example as a starting instance pq € Mg- On the one hand, this is not 
bad, since this is (or was) the common practice for EM training of probabilistic grammars. 
One the other hand, the EM algorithm gives us the freedom to experiment with a huge range of 
starting instances. Now, why not using the freedom that the EM algorithm donates? Indeed, 
the convergence proposition of Theorem \11\ permits that some starting instances of the EM 
algorithm may lead to better re-estimates than other starting instances. So, clearly, if we are 
experimenting with more than one single starting instance, then we can seriously hope that 
our efforts are re-compensated by getting a better probabilistic grammar! 



First of all, note that the first sentence is ambiguous, whereas the second is not. The following 
figure displays the full-parse trees of both. 



xi: 



X2: 



X3: 



NP 



Mary 



V 
I 

saw 



VP 



NP 



NP 



PP 




on a tree 



a bird on a tree 



a bird 



First EM iteration. In the E-step of the EM algorithm for PCFGs, a treebank fj-^ has 
to be generated on the basis of the starting instance q := Pq. The generation process is very 
easy. First, we have to calculate the probabilities of the full-parse trees xi,X2 and X3, which 
in turn provide us with the probabilities of the sentences yi and ?/2- Here are the results 



Po{x) 


x 


0.0078125 


Xl 


0.0312500 


X2 


0.0078125 


X3 



po{y) 


y 


0.0390625 


yi 


0.0078125 


y2 



For example, ^0(3^1) and po{yi) are calculated as follows (using definitions HM and IT^ *) 

Po(xi) = Yip{ry^^^^^ 
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/ NP \ / NP \ / PP ^ 

p{s — ► NP VP ) • p L- — • p( VP — ► V NP ) • p( V — . saw ) • p • p L-----~~-~^ 

yMary/ \ a bird/ Von a trefi 

1.00 • 0.25 ■ 0.50 • 1.00 • 0.25 • 0.25 • 1.00 
0.0078125 



yield{x)=yi 
= p{xi)+p{x2) 

= 0.0078125 + 0.0312500 
= 0.0390625 



Now, using the probability distribution q := po, we have to generate the treebank fj-^ by 
distributing the frequencies of the sentences to the full-parse trees. The result is 



fr,{x) 


X 


1 


Xl 


4 


X2 


10 


X3 



For example, fTq{xi) is calculated as follows (see line (3) of the EM procedure in Theorem lllj) : 

fr^ixi) = /(yield(xi)) • g(j;i|yield(xi)) 

= f{yi) ■ q{xi\yi) 
q{xi) 



f{yi) 

5 



0.0078125 



0.0390625 



1 



In the E-step of the EM algorithm for PCFGs, we have to read off the treebank grammar 
< G, pTq > from the treebank fj-q ■ Here is the result 



s — 


NP VP 


(1.000) 




VP - 


V NP 


(0.733 ^ 


1+10 ^ 

15 / 


VP 


V NP PP 


(0.267 ^ 


15 


NP - 


NP PP 


(0.268 ^ 


1+10 ^ 
41 / 


NP - 


— > Mary 


(0.122 ^ 


1+4 ^ 

41 / 


NP 


— > a bird 


(0.366 ^ 


1+4+10 
41 


NP 


— > a worm 


(0.244 ^ 


41 / 


PP - 


— > on a tree 


(1.000) 




V - 


saw 


(1.000) 





The probabilities of the rules of this grammar form our first EM re-estimate pi. So, we are 
ready for the second EM iteration. The following table displays the rules of our manually 
written CFG, as well as their probabilities allocated by the different EM re-estimates 
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CFG rule 



S - 

VP 

VP 

NP 

NP 

NP 

NP 

PP 

V - 



NP VP 

V NP 

V NP PP 
NP PP 

Mary 
— > a bird 
— ^ a worm 

on a tree 
• saw 



Po 


Pi 


P2 


P3 


Pis 


1.000 


1.000 


1.000 


1.000 


1.000 


0.500 


0.733 


0.807 


0.850 


0.967 


0.500 


0.267 


0.193 


0.150 


0.033 


0.250 


0.268 


0.287 


0.298 


0.326 


0.250 


0.122 


0.118 


0.117 


0.112 


0.250 


0.366 


0.357 


0.351 


0.337 


0.250 


0.244 


0.238 


0.234 


0.225 


1.000 


1.000 


1.000 


1.000 


1.000 


1.000 


1.000 


1.000 


1.000 


1.000 



In the 19*^* iteration, nothing new happens. So, wc can quit the algorithm and discuss the 
results. After all, of course, the most interesting thing for us is how these different PCFGs 
perform, if they are applied to ambiguity resolution. So, we will have a look at the probabilities 
these PCFGs allocate to the two analyzes of the first sentence. We have already noted several 
times that p prefers xi to X2, if 

p( VP — > V NP ) • p( NP — > NP PP ) > p( VP — > V NP PP ) 

The following table displays the values of these terms for our rc-estimatcd PCFGs 

p( VP — > V NP ) • p( NP — > NP PP ) p{ VP 



p 



Po 

Pi 
Pi 

P3 

Pis 



V NP^ 

0.500 • 0.250 
0.733 • 0.268 
0.807 • 0.287 : 
0.850 • 0.298 : 



V NP PP , 



: 0.125 

■■ 0.196 
0.232 
0.253 



0.967 • 0.326 = 0.315 



0.500 
0.267 
0.193 
0.150 

0.033 



So, the EM re-estimates prefer xi to X2 starting with the second EM iteration, due to the fact 
that the term p{ VP — > v np ) • p( np — > np pp ) is monotonic increasing (within a range from 
0.125 to 0.315), while at the same time the term p(vp — ► V np pp) is drastically monotonic 
decreasing (from 0.500 to 0.033). 
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